January 2023 - ceph-users

RGW - large omaps even when buckets are sharded

by Boris Behrens

Hi, since last week the scrubbing results in large omap warning. After some digging I've got these results: # searching for indexes with large omaps: $ for i in `rados -p eu-central-1.rgw.buckets.index ls`; do rados -p eu-central-1.rgw.buckets.index listomapkeys $i | wc -l | tr -d '\n' >> omapkeys echo " - ${i}" >> omapkeys done $ sort -n omapkeys | tail -n 15 212010 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2342226177.1.0 212460 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2342226177.1.3 212466 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2342226177.1.10 213165 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2342226177.1.4 354692 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.7 354760 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.5 354799 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.1 355040 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.10 355874 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.2 355930 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.3 356499 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.6 356583 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.8 356925 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.4 356935 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.9 358986 - .dir.ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1.0 So I have a bucket (ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1) with 11 shards where each shard got around 350k omapkeys. When checking what bucket it is is get a total different number: $ radosgw-admin bucket stats --bucket-id=ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1 { "bucket": "bucket", "num_shards": 11, "tenant": "", "zonegroup": "da651dc1-2663-4e1b-af2e-ac4454f24c9d", "placement_rule": "default-placement", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" }, "id": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2421332952.1", "marker": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.13", "index_type": "Normal", "owner": "user", "ver": "0#45265,1#44764,2#44631,3#44777,4#44859,5#44637,6#44814,7#44506,8#44853,9#44764,10#44813", "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0", "mtime": "2022-11-16T08:34:17.298979Z", "creation_time": "2021-11-16T09:13:34.480637Z", "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#", "usage": { "rgw.main": { "size": 66897607205, "size_actual": 68261179392, "size_utilized": 66897607205, "size_kb": 65329695, "size_kb_actual": 66661308, "size_kb_utilized": 65329695, "num_objects": 663369 }, "rgw.multimeta": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 0 } }, "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 } } It got 11 shards, with a total of 663k files. radosgw-admin bucket limit check gives 60k objects per shard. After getting a list with all omapkeys (total of 3917043) I see that entries look like this (only in less, not in cat) - the ^@ char: $ grep -aF object1 2421332952.1_omapkeys object1 object1^@v910^@i3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK object1^@v913^@iPVPdb60UlfOu4Mwzr.oqojwWzRdgheZ <80>1000_object1^@i3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK <80>1000_object1^(a)iPVPdb60UlfOu4Mwzr.oqojwWzRdgheZ <80>1001_object1 I also pulled the whole bucket index of said bucket via radosgw-admin bi list --bucket bucket > bucket_index_list and searched via jq for the object1: $ jq '.[] | select(.entry.name == "object1")' bucket_index_list { "type": "plain", "idx": "object1", "entry": { "name": "object1", "instance": "", "ver": { "pool": -1, "epoch": 0 }, "locator": "", "exists": "false", "meta": { "category": 0, "size": 0, "mtime": "0.000000", "etag": "", "storage_class": "", "owner": "", "owner_display_name": "", "content_type": "", "accounted_size": 0, "user_data": "", "appendable": "false" }, "tag": "", "flags": 8, "pending_map": [], "versioned_epoch": 0 } } { "type": "plain", "idx": "object1\u0000v910\u0000i3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK", "entry": { "name": "object1", "instance": "3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK", "ver": { "pool": -1, "epoch": 0 }, "locator": "", "exists": "false", "meta": { "category": 0, "size": 0, "mtime": "2022-12-16T00:00:28.651053Z", "etag": "", "storage_class": "", "owner": "user", "owner_display_name": "user", "content_type": "", "accounted_size": 0, "user_data": "", "appendable": "false" }, "tag": "delete-marker", "flags": 7, "pending_map": [], "versioned_epoch": 5 } } { "type": "plain", "idx": "object1\u0000v913\u0000iPVPdb60UlfOu4Mwzr.oqojwWzRdgheZ", "entry": { "name": "object1", "instance": "PVPdb60UlfOu4Mwzr.oqojwWzRdgheZ", "ver": { "pool": 11, "epoch": 2375707 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 10858, "mtime": "2021-12-15T12:05:30.351753Z", "etag": "8cc8ba9599322c17af56996bc0a85af0", "storage_class": "", "owner": "user", "owner_display_name": "user", "content_type": "image/jpeg", "accounted_size": 10858, "user_data": "", "appendable": "false" }, "tag": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297644265.64076989", "flags": 1, "pending_map": [], "versioned_epoch": 2 } } { "type": "instance", "idx": "�1000_object1\u0000i3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK", "entry": { "name": "object1", "instance": "3nb5Cdp00wrt3Phhbn4MgwTcsM7sdwK", "ver": { "pool": -1, "epoch": 0 }, "locator": "", "exists": "false", "meta": { "category": 0, "size": 0, "mtime": "2022-12-16T00:00:28.651053Z", "etag": "", "storage_class": "", "owner": "user", "owner_display_name": "user", "content_type": "", "accounted_size": 0, "user_data": "", "appendable": "false" }, "tag": "delete-marker", "flags": 7, "pending_map": [], "versioned_epoch": 5 } } { "type": "instance", "idx": "�1000_object1\u0000iPVPdb60UlfOu4Mwzr.oqojwWzRdgheZ", "entry": { "name": "object1", "instance": "PVPdb60UlfOu4Mwzr.oqojwWzRdgheZ", "ver": { "pool": 11, "epoch": 2375707 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 10858, "mtime": "2021-12-15T12:05:30.351753Z", "etag": "8cc8ba9599322c17af56996bc0a85af0", "storage_class": "", "owner": "user", "owner_display_name": "user", "content_type": "image/jpeg", "accounted_size": 10858, "user_data": "", "appendable": "false" }, "tag": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297644265.64076989", "flags": 1, "pending_map": [], "versioned_epoch": 2 } } Does anyone know what is happening here? And what should I do about the large omap objects? Reshard again?

1 year, 4 months

1
0
0 0

User access

by Rhys Powell

rhys.g.powell(a)gmail.com

1 year, 4 months

1
0
0 0

Remove failed multi-part uploads?

by rhys.g.powell＠gmail.com

Hello, We are running an older version of ceph - 14.2.22 nautilus We have a radosgw/s3 implementation and had some issues with multi-part uploads failing to complete. We used s3cmd to delete the failed uploads and clean out the bucket, but when reviewing the space utilization of buckets, it seems this one is still consuming space: [ ~]# radosgw-admin bucket stats --bucket=BUCKETNAME { "bucket": "BUCKETNAME", "num_shards": 32, "tenant": "", "zonegroup": "c73e02d6-d479-4cdc-bf86-8b09f0a9f6ba", "placement_rule": "default-placement", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" }, "id": "50ee73bc-bc08-4f9f-9d5b-4492cb4c5e77.1689003.1695", "marker": "50ee73bc-bc08-4f9f-9d5b-4492cb4c5e77.1689003.1695", "index_type": "Normal", "owner": "BUCKETNAME", "ver": "0#47066,1#30480,2#42797,3#36437,4#47308,5#33285,6#37127,7#24292,8#44567,9#34273,10#29402,11#36228,12#48153,13#32665,14#42314,15#21143,16#34319,17#42818,18#39301,19#23897,20#26225,21#50957,22#39706,23#29723,24#49619,25#44974,26#44020,27#22505,28#46702,29#49390,30#27263,31#21515", "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0", "mtime": "2021-02-08 13:06:13.311932Z", "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#", "usage": { "rgw.none": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 18446744073709551613 }, "rgw.main": { "size": 34247260247640, "size_actual": 34247284682752, "size_utilized": 34247260247640, "size_kb": 33444590086, "size_kb_actual": 33444613948, "size_kb_utilized": 33444590086, "num_objects": 340627 }, "rgw.multimeta": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 0 } }, "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 } } I see under the usage.rgw.main.size_kb_actual the value is 33444613948, or roughly 30TB When I use the radosgw-admin tool to list objects, I can see many failed multi-part uploads: [ ~]# radosgw-admin bucket list --bucket BUCKETNAME | jq '.[] | "\(.name), \(.meta.mtime), \(.meta.size)"' "_multipart_chi-pl-clh-shard-0-0-0-2021-02-10.tar.gz.2~07YXhKKZn2XYy-6F0itVB4tpuBm1q1J.1, 2021-02-10 00:57:08.033082Z, 4194304" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-10.tar.gz.2~07YXhKKZn2XYy-6F0itVB4tpuBm1q1J.2, 2021-02-10 00:56:36.463099Z, 8794011" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-10.tar.gz.2~b6-C6I3rky3V2Wh4H56jhsfVjvvTMj2.1, 2021-02-10 00:38:44.572199Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-10.tar.gz.2~b6-C6I3rky3V2Wh4H56jhsfVjvvTMj2.2, 2021-02-10 00:38:48.680330Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-10.tar.gz.2~b6-C6I3rky3V2Wh4H56jhsfVjvvTMj2.3, 2021-02-10 00:38:52.232674Z, 95445231" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~R8SwLZMVNM5kL4Ov7sX47mXdEJf0hfu.1, 2021-02-11 00:30:55.489965Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~R8SwLZMVNM5kL4Ov7sX47mXdEJf0hfu.2, 2021-02-11 00:30:58.832752Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~R8SwLZMVNM5kL4Ov7sX47mXdEJf0hfu.3, 2021-02-11 00:31:01.188868Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~R8SwLZMVNM5kL4Ov7sX47mXdEJf0hfu.4, 2021-02-11 00:30:53.035172Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~R8SwLZMVNM5kL4Ov7sX47mXdEJf0hfu.5, 2021-02-11 00:30:21.359861Z, 12448760" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~mPN97GOqO8E93gqVUbt_esJfB4kLu2h.1, 2021-02-11 00:11:52.163319Z, 4194304" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~mPN97GOqO8E93gqVUbt_esJfB4kLu2h.2, 2021-02-11 00:11:48.293292Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~mPN97GOqO8E93gqVUbt_esJfB4kLu2h.3, 2021-02-11 00:11:55.320413Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~mPN97GOqO8E93gqVUbt_esJfB4kLu2h.4, 2021-02-11 00:11:55.039628Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-11.tar.gz.2~mPN97GOqO8E93gqVUbt_esJfB4kLu2h.5, 2021-02-11 00:11:26.493213Z, 2005541" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-12.tar.gz.2~05JmbiZqt8tvgVmJ3Ef6WEzBa3Jla7L.1, 2021-02-12 00:53:24.453273Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-12.tar.gz.2~05JmbiZqt8tvgVmJ3Ef6WEzBa3Jla7L.2, 2021-02-12 00:54:00.743677Z, 9835956" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-12.tar.gz.2~90wJZ6jaWa6BaQC88e9YdXJwsqyme3u.1, 2021-02-12 00:59:24.943370Z, 104857600" "_multipart_chi-pl-clh-shard-0-0-0-2021-02-12.tar.gz.2~90wJZ6jaWa6BaQC88e9YdXJwsqyme3u.10, 2021-02-12 00:56:56.621609Z, 4194304" ... However, when I try to delete one of these object via radosgw-admin, I receive an error that the object is not found: [ ~]# radosgw-admin object rm --bucket BUCKETNAME --object=_multipart_chi-pl-clh-shard-0-0-0-2021-04-17.tar.gz.2~CVL_xbfGjdDckHe_hpJxoUSynjotOtR.18 ERROR: object remove returned: (2) No such file or directory When I list object via S3 API, none are found: [ minio-binaries]# ./mc ls BUCKETNAME [2021-02-08 08:06:13 EST] 0B BUCKETNAME/ [ minio-binaries]# ./mc ls BUCKETNAME/FOLDER [ minio-binaries]# ./mc ls BUCKETNAME/FOLDER [ minio-binaries]# ./mc ls --incomplete BUCKETNAME/FOLDER You have mail in /var/spool/mail/root [ minio-binaries]# I am wondering if I were to delete the bucket BUCKETNAME with the --purge option, would that remove the objects? Are the objects actually there? How can I confirm the objects exist? If I delete the bucket, how can I confirm the objects are gone? Any help would be greatly appreciated! Rhys

1 year, 4 months

1
0
0 0

MDS error

by André de Freitas Smaira

Hello! Yesterday we found some errors in our cephadm disks, which is making it impossible to access our HPC Cluster: # ceph health detail HEALTH_WARN 3 failed cephadm daemon(s); insufficient standby MDS daemons available [WRN] CEPHADM_FAILED_DAEMON: 3 failed cephadm daemon(s) daemon mds.cephfs.s1.nvopyf on s1.ceph.infra.ufscar.br is in error state daemon mds.cephfs.s2.qikxmw on s2.ceph.infra.ufscar.br is in error state daemon mds.cftv.s2.anybzk on s2.ceph.infra.ufscar.br is in error state [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more Googling we found out that we should remove the failed MDS, but the data in these disks is relatively important. We would like to know if we need to remove it or if it can be fixed, and if we have to remove it if the data will be lost. Please tell me if you need more information. Thanks in advance, André de Freitas Smaira Federal University of São Carlos - UFSCar

1 year, 4 months

3
3
0 0

Re: Telemetry service is temporarily down

by Yaarit Hatuka

Hi everyone, Our telemetry service is up and running again. Thanks Adam Kraitman and Dan Mick for restoring the service. We thank you for your patience and appreciate your contribution to the project! Thanks, Yaarit On Tue, Jan 3, 2023 at 3:14 PM Yaarit Hatuka <yhatuka(a)redhat.com> wrote: > Hi everyone, > > We are having some infrastructure issues with our telemetry backend, and > we are working on fixing it. > Thanks Jan Horacek for opening this issue > <https://tracker.ceph.com/issues/58371> [1]. We will update once the > service is back up. > We are sorry for any inconvenience you may be experiencing, and appreciate > your patience. > > Thanks, > Yaarit > > [1] https://tracker.ceph.com/issues/58371 >

1 year, 4 months

1
0
0 0

Current min_alloc_size of OSD?

by Robert Sander

Hi, Ceph 16 Pacific introduced a new smaller default min_alloc_size of 4096 bytes for HDD and SSD OSDs. How can I get the current min_allloc_size of OSDs that were created with older Ceph versions? Is there a command that shows this info from the on disk format of a bluestore OSD? Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 93818 B Geschäftsführer: Peer Heinlein - Sitz: Berlin

1 year, 4 months

5
7
0 0

Problem with IO after renaming File System .data pool

by Murilo Morais

Good morning everyone. That night we went through an accident, where they accidentally renamed the .data pool of a File System making it instantly inaccessible, when renaming it again to the correct name it was possible to mount and list the files, but could not read or write. When trying to write, the FS returned as Read Only, when trying to read it returned Operation not allowed. After a period of breaking my head I tried to mount with the ADMIN user and everything worked correctly. I tried to remove the authentication of the current user through `ceph auth rm`, I created a new user through `ceph fs authorize <fs_name> client.<user> / rw` and it continued the same way, I also tried to recreate it through `ceph auth get-or-create` and nothing different happened, it stayed exactly the same. After setting `allow *` in mon, mds and osd I was able to mount, read and write again with the new user. I can understand why the File System stopped after renaming the pool, what I don't understand is why users are unable to perform operations on FS even with RW or any other user created. What could have happened behind the scenes to not be able to perform IO even with the correct permissions? Or did I apply incorrect permissions that caused this problem? Right now everything is working, I would really like to understand what happened, because I didn't find anything documented about this type of incident.

1 year, 4 months

1
0
0 0

radosgw ceph.conf question

by Boris Behrens

Hi, I am just reading through this document ( https://docs.ceph.com/en/octopus/radosgw/config-ref/) and on the top is states: The following settings may added to the Ceph configuration file (i.e., > usually ceph.conf) under the [client.radosgw.{instance-name}] section. > And my ceph.conf looks like this: [client.eu-central-1-s3db3] > rgw_frontends = beast endpoint=[::]:7482 > rgw_region = eu > rgw_zone = eu-central-1 > > [client.eu-central-1-s3db3-old] > rgw_frontends = beast endpoint=[::]:7480 > rgw_region = eu > rgw_zone = eu-central-1 > > [client.eu-customer-1-s3db3] > rgw_frontends = beast endpoint=[::]:7481 > rgw_region = eu-someother > rgw_zone = eu-someother-1 > Do I need to change the section names? It also seems that rgw_region is a non-existing config value (this might have come from very old RHCS documentation) Would be very nice if someone could help me clarify this. Cheers and happy weekend Boris

1 year, 4 months

1
0
0 0

Re: iscsi target lun error

by Frédéric Nass

Hi Xiubo, Randy, This is due to '<host_ip_address> host.containers.internal' being added to the container's /etc/hosts since Podman 4.1+. The workaround consists of either downgrading Podman package to v4.0 (on RHEL8, dnf downgrade podman-4.0.2-6.module+el8.6.0+14877+f643d2d6) or adding the --no-hosts option to 'podman run' command in /var/lib/ceph/$(ceph fsid)/iscsi.iscsi.test-iscsi1.xxxxxx/unit.run and restart the iscsi container service. [1] and [2] could well have the same cause. RHCS Block Device Guide [3] quotes RHEL 8.4 as a prerequisites. I don't know what was the version of Podman in RHEL 8.4 at the time, but with RHEL 8.7 and Podman 4.2, it's broken. I'll open a RHCS case today to have it fixed and have other containers like grafana, prometheus, etc. being checked against this new podman behavior. Regards, Frédéric. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1979449 [2] https://tracker.ceph.com/issues/57018 [3] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-s… ----- Le 21 Nov 22, à 6:45, Xiubo Li xiubli(a)redhat.com a écrit : > On 15/11/2022 23:44, Randy Morgan wrote: >> You are correct I am using the cephadm to create the iscsi portals. >> The cluster had been one I was learning a lot with and I wondered if >> it was because of the number of creations and deletions of things, so >> I rebuilt the cluster, now I am getting this response even when >> creating my first iscsi target. Here is the output of the gwcli ls: >> >> sh-4.4# gwcli ls >> o- / >> ........................................................................................................................ >> [...] >> o- cluster >> ........................................................................................................ >> [Clusters: 1] >> | o- ceph >> ......................................................................................................... >> [HEALTH_WARN] >> | o- pools >> ......................................................................................................... >> [Pools: 8] >> | | o- .rgw.root >> ............................................................ [(x3), >> Commit: 0.00Y/71588776M (0%), Used: 1323b] >> | | o- cephfs_data >> .......................................................... [(x3), >> Commit: 0.00Y/71588776M (0%), Used: 1639b] >> | | o- cephfs_metadata >> ...................................................... [(x3), Commit: >> 0.00Y/71588776M (0%), Used: 3434b] >> | | o- default.rgw.control >> .................................................. [(x3), Commit: >> 0.00Y/71588776M (0%), Used: 0.00Y] >> | | o- default.rgw.log >> ...................................................... [(x3), Commit: >> 0.00Y/71588776M (0%), Used: 3702b] >> | | o- default.rgw.meta >> ...................................................... [(x3), Commit: >> 0.00Y/71588776M (0%), Used: 382b] >> | | o- device_health_metrics >> ................................................ [(x3), Commit: >> 0.00Y/71588776M (0%), Used: 0.00Y] >> | | o- rhv-ceph-ssd >> ..................................................... [(x3), Commit: >> 0.00Y/7868560896K (0%), Used: 511746b] >> | o- topology >> .............................................................................................. >> [OSDs: 36,MONs: 3] >> o- disks >> ...................................................................................................... >> [0.00Y, Disks: 0] >> o- iscsi-targets >> .............................................................................. >> [DiscoveryAuth: None, Targets: 1] >> o- iqn.2001-07.com.ceph:1668466555428 >> ............................................................... [Auth: >> None, Gateways: 1] >> o- disks >> ......................................................................................................... >> [Disks: 0] >> o- gateways >> ........................................................................................... >> [Up: 1/1, Portals: 1] >> | o- host.containers.internal >> ........................................................................ >> [192.168.105.145 (UP)] > > Please manually remove this gateway before doing further steps. > > It should be a bug in cephadm and you can raise one tracker for this. > > Thanks > > >> o- host-groups >> ................................................................................................. >> [Groups : 0] >> o- hosts >> ...................................................................................... >> [Auth: ACL_ENABLED, Hosts: 0] >> sh-4.4# >> >> Randy >> >> On 11/9/2022 6:36 PM, Xiubo Li wrote: >>> >>> On 10/11/2022 02:21, Randy Morgan wrote: >>>> I am trying to create a second iscsi target and I keep getting an >>>> error when I create the second target: >>>> >>>> >>>> Failed to update target 'iqn.2001-07.com.ceph:1667946365517' >>>> >>>> disk create/update failed on host.containers.internal. LUN >>>> allocation failure >>>> >>> I think you were using the cephadm to add the iscsi targets, not the >>> gwcli or Rest APIs directly. >>> >>> Before we hit other issues were login failures, that because there >>> were two gateways using the same IP address. Please share your `gwcli >>> ls` output to see what the 'host.containers.internal' gateway's config. >>> >>> Thanks! >>> >>> >>>> I am running ceph Pacific: *Version* >>>> 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) >>>> pacific (stable) >>>> >>>> All of the information I can find on this problem is from 3 years >>>> ago and doesn't seem to apply any more. Does anyone know how to >>>> correct this problem? >>>> >>>> Randy >>>> >>> >> > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

1 year, 4 months

2
1
0 0

Re: Newer linux kernel cephfs clients is more trouble?

by Manuel Holtgrewe

Dear Xiubo, could you explain how to enable kernel debug logs (I assume this is on the client)? Thanks, Manuel On Fri, May 13, 2022 at 9:39 AM Xiubo Li <xiubli(a)redhat.com> wrote: > > On 5/12/22 12:06 AM, Stefan Kooman wrote: > > Hi List, > > > > We have quite a few linux kernel clients for CephFS. One of our > > customers has been running mainline kernels (CentOS 7 elrepo) for the > > past two years. They started out with 3.x kernels (default CentOS 7), > > but upgraded to mainline when those kernels would frequently generate > > MDS warnings like "failing to respond to capability release". That > > worked fine until 5.14 kernel. 5.14 and up would use a lot of CPU and > > *way* more bandwidth on CephFS than older kernels (order of > > magnitude). After the MDS was upgraded from Nautilus to Octopus that > > behavior is gone (comparable CPU / bandwidth usage as older kernels). > > However, the newer kernels are now the ones that give "failing to > > respond to capability release", and worse, clients get evicted > > (unresponsive as far as the MDS is concerned). Even the latest 5.17 > > kernels have that. No difference is observed between using messenger > > v1 or v2. MDS version is 15.2.16. > > Surprisingly the latest stable kernels from CentOS 7 work flawlessly > > now. Although that is good news, newer operating systems come with > > newer kernels. > > > > Does anyone else observe the same behavior with newish kernel clients? > > There have some known bugs, which have been fixed or under fixing > recently, even in the mainline and, not sure whether are they related. > Such as [1][2][3][4]. More detail please see ceph-client repo testing > branch [5]. > > I have never see the "failing to respond to capability release" issue > yet, if you have the MDS logs(debug_mds = 25 and debug_ms = 1) and > kernel debug logs will be better to help debug it further, or provide > the steps to reproduce it. > > [1] https://tracker.ceph.com/issues/55332 > [2] https://tracker.ceph.com/issues/55421 > [3] https://bugzilla.redhat.com/show_bug.cgi?id=2063929 > [4] https://tracker.ceph.com/issues/55377 > [5] https://github.com/ceph/ceph-client/commits/testing > > Thanks > > -- Xiubo > > > > > Gr. Stefan > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

1 year, 4 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2023