June 2023 - ceph-users - lists.ceph.io

cephadm upgrade 16.2.10 to 16.2.11: osds crash and get stuck restarting

by Zakhar Kirpichenko

Hi, Attempted to upgrade 16.2.10 to 16.2.11, 2 OSDs out of many started crashing in a loop on the very 1st host: Jan 25 23:07:53 ceph01 bash[2553123]: Uptime(secs): 0.0 total, 0.0 interval Jan 25 23:07:53 ceph01 bash[2553123]: Flush(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Total Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(L0 Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Keys): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Jan 25 23:07:53 ceph01 bash[2553123]: ** File Read Latency Histogram By Level [P] ** Jan 25 23:07:53 ceph01 bash[2553123]: debug -10> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986439) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -9> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986493) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -8> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986500) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -7> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986505) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -6> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1676] [O-2] [JOB 9] Compacting 4@0 + 2@1 files to L1, score 1.0 0 Jan 25 23:07:53 ceph01 bash[2553123]: debug -5> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1680] [O-2] Compaction start summary: Base version 33 Base level 0, inputs: [649058(959KB) 649046(1510KB) 649024(1323KB) 649002(1396KB)], [648981(66MB) 648982(52MB)] Jan 25 23:07:53 ceph01 bash[2553123]: debug -4> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1674688072986547, "job": 9, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [649058, 649046, 649024, 649002], "files_L1": [648981, 648982], "score": 1, "input_data_size": 129161327} Jan 25 23:07:53 ceph01 bash[2553123]: debug -3> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 bdev(0x5619bf0ce400 /var/lib/ceph/osd/ceph-3/block) _aio_thread got r=-1 ((1) Operation not permitted) Jan 25 23:07:53 ceph01 bash[2553123]: debug -2> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f7c8b66e700 time 2023-01-25T23:07:52.993976+0000 Jan 25 23:07:53 ceph01 bash[2553123]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: 604: ceph_abort_msg("Unexpected IO error. This may suggest HW issue. Please check your dmesg!") Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x5619b2fc1adc] Jan 25 23:07:53 ceph01 bash[2553123]: 2: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 4: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] Jan 25 23:07:53 ceph01 bash[2553123]: 5: clone() Jan 25 23:07:53 ceph01 bash[2553123]: debug -1> 2023-01-25T23:07:52.994+0000 7f7c8b66e700 -1 *** Caught signal (Aborted) ** Jan 25 23:07:53 ceph01 bash[2553123]: in thread 7f7c8b66e700 thread_name:bstore_aio Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7c9788fcf0] Jan 25 23:07:53 ceph01 bash[2553123]: 2: gsignal() Jan 25 23:07:53 ceph01 bash[2553123]: 3: abort() Jan 25 23:07:53 ceph01 bash[2553123]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x5619b2fc1bad] Jan 25 23:07:53 ceph01 bash[2553123]: 5: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 6: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 7: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] OSD kept crashing until the host reboot, an OSD restart wouldn't help. This hasn't happened during any previous upgrades, so was a rather unexpected development. Unclear what caused this, but a host reboot seems to have fixed it. It happened to 1 other OSD on another host, exactly the same symptoms, also solved by a reboot. Best regards, Zakhar

9 months, 4 weeks

4
4
0 0

Reef release candidate - v18.1.2

by Yuri Weinstein

Hi everyone, This is the second release candidate for Reef. The Reef release comes with a new RockDB version (7.9.2) [0], which incorporates several performance improvements and features. Our internal testing doesn't show any side effects from the new version, but we are very eager to hear community feedback on it. This is the first release to have the ability to tune RockDB settings per column family [1], which allows for more granular tunings to be applied to different kinds of data stored in RocksDB. A new set of settings has been used in Reef to optimize performance for most kinds of workloads with a slight penalty in some cases, outweighed by large improvements in use cases such as RGW, in terms of compactions and write amplification. We would highly encourage community members to give these a try against their performance benchmarks and use cases. The detailed list of changes in terms of RockDB and BlueStore can be found in https://pad.ceph.com/p/reef-rc-relnotes. If any of our community members would like to help us with performance investigations or regression testing of the Reef release candidate, please feel free to provide feedback via email or in https://pad.ceph.com/p/reef_scale_testing. For more active discussions, please use the #ceph-at-scale slack channel in ceph-storage.slack.com. This RC has gone thru partial testing due to issues we are experiencing in the sepia lab. Please try it out and report any issues you encounter. Happy testing! Thanks, YuriW

10 months

4
6
0 0

CephFS snapshots: impact of moving data

by Kuhring, Mathias

Dear Ceph community, We want to restructure (i.e. move around) a lot of data (hundreds of terabyte) in our CephFS. And now I was wondering what happens within snapshots when I move data around within a snapshotted folder. I.e. do I need to account for a lot increased storage usage due to older snapshots differing from the new restructured state? In the end it is just metadata changes. Are the snapshots aware of this? Consider the following examples. Copying data: Let's say I have a folder /test, with a file XYZ in sub-folder /test/sub1 and an empty sub-folder /test/sub2. I create snapshot snapA in /test/.snap, copy XYZ to sub-folder /test/sub2, delete it from /test/sub1 and create another snapshot snapB. I would have two snapshots each with distinct copies of XYZ, hence using double the space in the FS: /test/.snap/snapA/sub1/XYZ <-- copy 1 /test/.snap/snapA/sub2/ /test/.snap/snapB/sub1/ /test/.snap/snapB/sub2/XYZ <-- copy 2 Moving data: Let's assume the same structure. But now after creating snapshot snapA, I move XYZ to sub-folder /test/sub2 and then create the other snapshot snapB. The directory tree will look the same. But how is this treated internally? Once I move the data, will there be an actually copy created in snapA to represent the old state? Or will this remain the same data (like a link to the inode or so)? And hence not double the storage used for that file. I couldn't find (or understand) anything related to this in the docs. The closest seems to be the hard-link section here: https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links Which unfortunately goes a bit over my head. So I'm not sure if this answers my question. Thank you all for your help. Appreciate it. Best Wishes, Mathias Kuhring

10 months, 1 week

3
2
1 0

RGW accessing real source IP address of a client (e.g. in S3 bucket policies)

by Christian Rohmann

Hello Ceph-Users, context or motivation of my question is S3 bucket policies and other cases using the source IP address as condition. I was wondering if and how RadosGW is able to access the source IP address of clients if receiving their connections via a loadbalancer / reverse proxy like HAProxy. So naturally that is where the connection originates from in that case, rendering a policy based on IP addresses useless. Depending on whether the connection balanced as HTTP or TCP there are two ways to carry information about the actual source: * In case of HTTP via headers like "X-Forwarded-For". This is apparently supported only for logging the source in the "rgw ops log" ([1])? Or is this info used also when evaluating the source IP condition within a bucket policy? * In case of TCP loadbalancing, there is the proxy protocol v2. This unfortunately seems not even supposed by the BEAST library which RGW uses. I opened feature requests ... ** https://tracker.ceph.com/issues/59422 ** https://github.com/chriskohlhoff/asio/issues/1091 ** https://github.com/boostorg/beast/issues/2484 but there is no outcome yet. Regards Christian [1] https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_remote_addr…

10 months, 1 week

3
5
0 0

[multisite] The purpose of zonegroup

by Yixin Jin

Hi folks, In the multisite environment, we can get one realm that contains multiple zonegroups, each in turn can have multiple zones. However, the purpose of zonegroup isn't clear to me. It seems that when a user is created, its metadata is synced to all zones within the same realm, regardless whether they are in different zonegroups or not. The same happens to buckets. Therefore, what is the purpose of having zonegroups? Wouldn't it be easier to just have realm and zones? Thanks,Yixin

10 months, 1 week

4
6
0 0

Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

by Work Ceph

Hello guys, We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows clients. Recently, we had the need to add some VMWare clusters as clients for the iSCSI GW and also Windows systems with the use of Clustered Storage Volumes (CSV), and we are facing a weird situation. In windows for instance, the iSCSI block can be mounted, formatted and consumed by all nodes, but when we add in the CSV it fails with some generic exception. The same happens in VMWare, when we try to use it with VMFS it fails. We do not seem to find the root cause for these errors. However, the errors seem to be linked to the situation of multiple nodes consuming the same block by shared file systems. Have you guys seen this before? Are we missing some basic configuration in the iSCSI GW?

10 months, 1 week

4
7
0 0

Re: Ceph iSCSI GW is too slow when compared with Raw RBD performance

by Work Ceph

Awesome, thanks for the info! By any chance, do you happen to know what configurations you needed to adjust to make Veeam perform a bit better? On Fri, Jun 23, 2023 at 10:42 AM Anthony D'Atri <aad(a)dreamsnake.net> wrote: > Yes, with someone I did some consulting for. Veeam seems to be one of the > prevalent uses for ceph-iscsi, though I'd try to use the native RBD client > instead if possible. > > Veeam appears by default to store really tiny blocks, so there's a lot of > protocol overhead. I understand that Veeam can be configured to use "large > blocks" that can make a distinct difference. > > > > On Jun 23, 2023, at 09:33, Work Ceph <work.ceph.user.mailing(a)gmail.com> > wrote: > > Great question! > > Yes, one of the slowness was detected in a Veeam setup. Have you > experienced that before? > > On Fri, Jun 23, 2023 at 10:32 AM Anthony D'Atri <aad(a)dreamsnake.net> > wrote: > >> Are you using Veeam by chance? >> >> > On Jun 22, 2023, at 21:18, Work Ceph <work.ceph.user.mailing(a)gmail.com> >> wrote: >> > >> > Hello guys, >> > >> > We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD >> > for some workloads, RadosGW (via S3) for others, and iSCSI for some >> Windows >> > clients. >> > >> > We started noticing some unexpected performance issues with iSCSI. I >> mean, >> > an SSD pool is reaching 100MB of write speed for an image, when it can >> > reach up to 600MB+ of write speed for the same image when mounted and >> > consumed directly via RBD. >> > >> > Is that performance degradation expected? We would expect some >> degradation, >> > but not as much as this one. >> > >> > Also, we have a question regarding the use of Intel Turbo boost. Should >> we >> > disable it? Is it possible that the root cause of the slowness in the >> iSCSI >> > GW is caused by the use of Intel Turbo boost feature, which reduces the >> > clock of some cores? >> > >> > Any feedback is much appreciated. >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >

10 months, 1 week

2
2
0 0

radosgw new zonegroup hammers master with metadata sync

by Boris Behrens

Hi, yesterday I added a new zonegroup and it looks like it seems to cycle over the same requests over and over again. In the log of the main zone I see these requests: 2023-06-20T09:48:37.979+0000 7f8941fb3700 1 beast: 0x7f8a602f3700: fd00:2380:0:24::136 - - [2023-06-20T09:48:37.979941+0000] "GET /admin/log?type=metadata&id=62&period=e8fc96f1-ae86-4dc1-b432-470b0772fded&max-entries=100&&rgwx-zonegroup=b39392eb-75f8-47f0-b4f3-7d3882930b26 HTTP/1.1" 200 44 - - - Only thing that changes is the &id. We have two other zonegroups that are configured identical (ceph.conf and period) and these don;t seem to spam the main rgw. root@host:~# radosgw-admin sync status realm 5d6f2ea4-b84a-459b-bce2-bccac338b3ef (main) zonegroup b39392eb-75f8-47f0-b4f3-7d3882930b26 (dc3) zone 96f5eca9-425b-4194-a152-86e310e91ddb (dc3) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master root@host:~# radosgw-admin period get { "id": "e8fc96f1-ae86-4dc1-b432-470b0772fded", "epoch": 92, "predecessor_uuid": "5349ac85-3d6d-4088-993f-7a1d4be3835a", "sync_status": [ "", ... "" ], "period_map": { "id": "e8fc96f1-ae86-4dc1-b432-470b0772fded", "zonegroups": [ { "id": "b39392eb-75f8-47f0-b4f3-7d3882930b26", "name": "dc3", "api_name": "dc3", "is_master": "false", "endpoints": [ ], "hostnames": [ ], "hostnames_s3website": [ ], "master_zone": "96f5eca9-425b-4194-a152-86e310e91ddb", "zones": [ { "id": "96f5eca9-425b-4194-a152-86e310e91ddb", "name": "dc3", "endpoints": [ ], "log_meta": "false", "log_data": "false", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "5d6f2ea4-b84a-459b-bce2-bccac338b3ef", "sync_policy": { "groups": [] } }, ... -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

10 months, 1 week

3
5
0 0

upgrading from 15.2.17 to 16.2.11 - Health ERROR

by xadhoom76＠gmail.com

hi , starting upgrade from 15.2.17 i got this error Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0) Cluster was in health ok before starting.

10 months, 1 week

5
9
0 0

db/wal pvmoved ok, but gui show old metadatas

by Christophe BAILLON

Hello, we have a Ceph 17.2.5 cluster with a total of 26 nodes, where 15 nodes that have faulty NVMe drives, where the db/wal resides (one NVMe for the first 6 OSDs and another for the remaining 6). We replaced them with new drives and pvmoved it to avoid losing the OSDs. So far, there are no issues, and the OSDs are functioning properly. ceph see the correct news disks root@node02:/# ceph daemon osd.26 list_devices [ { "device": "/dev/nvme0n1", "device_id": "INTEL_SSDPEDME016T4S_CVMD516500851P6KGN" }, { "device": "/dev/sdc", "device_id": "SEAGATE_ST18000NM004J_ZR52TT830000C148JFSJ" } ] However, the Cephadm GUI still shows the old NVMe drives and hasn't recognized the device change. How can we make the GUI and Cephadm recognize the new devices? I tried restarting the managers, thinking that it would rescan the OSDs during startup, but it didn't work. If you have any ideas, I would appreciate it. Should I perform something like that: ceph orch daemon reconfig osd.* Thank you for your help.

10 months, 1 week

1
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2023