June 2021 - ceph-users - lists.ceph.io

by Szabo, Istvan (Agoda)

Hi, I’m continuously getting scrub errors in my index pool and log pool that I need to repair always. HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 2 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 20.19 is active+clean+inconsistent, acting [39,41,37] Why is this? I have no cue at all, no log entry no anything ☹ ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

1 year, 3 months

6
9
0 0

Upgrade tips from Luminous to Nautilus?

by Mark Schouten

Hi, We've done our fair share of Ceph cluster upgrades since Hammer, and have not seen much problems with them. I'm now at the point that I have to upgrade a rather large cluster running Luminous and I would like to hear from other users if they have experiences with issues I can expect so that I can anticipate on them beforehand. As said, the cluster is running Luminous (12.2.13) and has the following services active: services: mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04 mgr: osdnode01(active), standbys: osdnode02, osdnode03 mds: pmrb-3/3/3 up {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1 up:standby osd: 116 osds: 116 up, 116 in; rgw: 3 daemons active Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster is 1.01PiB. We have 2 active crush-rules on 18 pools. All pools have a size of 3 there is a total of 5760 pgs. { "rule_id": 1, "rule_name": "hdd-data", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "ssd-data", "ruleset": 2, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -21, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } rbd -> crush_rule: hdd-data .rgw.root -> crush_rule: hdd-data default.rgw.control -> crush_rule: hdd-data default.rgw.data.root -> crush_rule: ssd-data default.rgw.gc -> crush_rule: ssd-data default.rgw.log -> crush_rule: ssd-data default.rgw.users.uid -> crush_rule: hdd-data default.rgw.usage -> crush_rule: ssd-data default.rgw.users.email -> crush_rule: hdd-data default.rgw.users.keys -> crush_rule: hdd-data default.rgw.meta -> crush_rule: hdd-data default.rgw.buckets.index -> crush_rule: ssd-data default.rgw.buckets.data -> crush_rule: hdd-data default.rgw.users.swift -> crush_rule: hdd-data default.rgw.buckets.non-ec -> crush_rule: ssd-data DB0475 -> crush_rule: hdd-data cephfs_pmrb_data -> crush_rule: hdd-data cephfs_pmrb_metadata -> crush_rule: ssd-data All but four clients are running Luminous, the four are running Jewel (that needs upgrading before proceeding with this upgrade). So, normally, I would 'just' upgrade all Ceph packages on the monitor-nodes and restart mons and then mgrs. After that, I would upgrade all Ceph packages on the OSD nodes and restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting the OSD's will probably take a while. If anyone has a hint on what I should expect to cause some extra load or waiting time, that would be great. Obviously, we have read https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking for real world experiences. Thanks! -- Mark Schouten | Tuxis B.V. KvK: 74698818 | http://www.tuxis.nl/ T: +31 318 200208 | info(a)tuxis.nl

2 years, 9 months

4
11
0 0

Create and listing topics with AWS4 fails

by Daniel Iwan

Hi I'm on Pacific 16.2.1 Documentation states that topic operations should be created using REST with application/x-www-form-urlencoded See https://docs.ceph.com/en/latest/radosgw/notifications/#topics However when attempting to create one using Postman (auth v4) operation fails. <?xml version="1.0" encoding="UTF-8"?> <Error> <Code>NotImplemented</Code> <RequestId>tx00000000000000000004e-0060d207d0-df5c88-abcf</RequestId> <HostId>df5c88-abc-default</HostId> </Error>. The same error is for listing topics. See log attached I think similar issues were reported on the threads - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/23BZW2Q3TCU… - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/WMKEYKTE5NH… - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/2G4YWOUMVE2… I think the handling procedure is in https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rest_s3.cc get_auth_data_v4()which does not have RGW_OP_PUBSUB_TOPIC_CREATE or similar topic-related operations. If that's the case this makes it more difficult to integrate those operations as majority of the sw uses AWSV4 authentication Can anyone confirm if the doc is wrong, auth V3 needs to be used etc? Is there a tracking issue for this already? Regards Daniel

2 years, 9 months

2
4
0 0

ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

by Arkadiy Kulev

The pool *default.rgw.buckets.data* has *501 GiB* stored, but USED shows *3.5 TiB *(7 times higher!)*:* root@ceph-01:~# ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB .rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB default.rgw.control 4 32 0 B 8 0 B 0 61 TiB default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB The *default.rgw.buckets.data* pool is using erasure coding: root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST crush-device-class=hdd crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=6 m=4 plugin=jerasure technique=reed_sol_van w=8 If anyone could help explain why it's using up 7 times more space, it would help a lot. Versioning is disabled. ceph version 15.2.13 (octopus stable). Sincerely, Ark.

2 years, 9 months

5
9
1 0

Spurious Read Errors: 0x6706be76

by Jay Sullivan

In the week since upgrading one of our clusters from Nautilus 14.2.21 to Pacific 16.2.4 I've seen four spurious read errors that always have the same bad checksum of 0x6706be76. I've never seen this in any of our clusters before. Here's an example of what I'm seeing in the logs: ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location [0x18c81b40000~1000], logical extent 0x200000~1000, object #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head# I'm not seeing any indication of inconsistent PGs, only the spurious read error. I don't see an explicit indication of a retry in the logs following the above message. Bluestore code to retry three times was introduced in 2018 following a similar issue with the same checksum: https://tracker.ceph.com/issues/22464 Here's an example of what my health detail looks like: HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors osd.117 reads with retries: 1 I followed this (unresolved) thread, too: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZY… I do have swap enabled, but I don't think memory pressure is an issue with 30GB available out of 96GB (and no sign I've been close to summoning the OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the spurious read errors are busy 12TB rotational HDDs and I _think_ it's only happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux. Does Pacific retry three times on a spurious read error? Would I see an indication of a retry in the logs? Thanks! ~Jay

2 years, 9 months

3
4
0 0

Re: XFS on RBD on EC painfully slow

by Sebastian Knust

Hi Reed, To add to this command by Weiwen: On 28.05.21 13:03, 胡玮文 wrote: > Have you tried just start multiple rsync process simultaneously to transfer different directories? Distributed system like ceph often benefits from more parallelism. When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a few months ago, I used msrsync [1] and was quite happy with the speed. For your use case, I would start with -p 12 but might experiment with up to -p 24 (as you only have 6C/12T in your CPU). With many small files, you also might want to increase -s from the default 1000. Note that msrsync does not work with the --delete rsync flag. As I was syncing a live system, I ended up with this workflow: - Initial sync with msrsync (something like ./msrsync -p 12 --progress --stats --rsync "-aS --numeric-ids" ...) - Second sync with msrsync (to sync changes during the first sync) - Take old storage off-line for users / read-only - Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...) - Mount cephfs at location of old storage, adjust /etc/exports with fsid entries where necessary, turn system back on-line / read-write Cheers Sebastian [1] https://github.com/jbd/msrsync

2 years, 9 months

3
2
0 0

Pacific: RadosGW crashing on multipart uploads.

by Chu, Vincent

Hi, I'm running into an issue with RadosGW where multipart uploads crash, but only on buckets with a hyphen, period or underscore in the bucket name and with a bucket policy applied. We've tested this in pacific 16.2.3 and pacific 16.2.4. Anyone run into this before? ubuntu@ubuntu:~/ubuntu$ aws --endpoint http://placeholder.com:7480 s3 cp ubuntu.iso s3://bucket.test upload failed: ./ubuntu.iso to s3://bucket.test/ubuntu.iso Connection was closed before we received a valid response from endpoint URL: "http://placeholder.com:7480/bucket.test/ubuntu.iso?uploads". Here is the crash log. -12> 2021-06-29T20:44:10.940+0000 7fae1f4ec700 1 ====== starting new request req=0x7fadf8998620 ===== -11> 2021-06-29T20:44:10.940+0000 7fae1f4ec700 2 req 2403 0.000000000s initializing for trans_id = tx000000000000000000963-0060db861a-17e77ee-default -10> 2021-06-29T20:44:10.940+0000 7fae1f4ec700 2 req 2403 0.000000000s getting op 4 -9> 2021-06-29T20:44:10.940+0000 7fae1f4ec700 2 req 2403 0.000000000s s3:init_multipart verifying requester -8> 2021-06-29T20:44:10.948+0000 7fae1f4ec700 2 req 2403 0.008000608s s3:init_multipart normalizing buckets and tenants -7> 2021-06-29T20:44:10.948+0000 7fae1f4ec700 2 req 2403 0.008000608s s3:init_multipart init permissions -6> 2021-06-29T20:44:10.954+0000 7faedf66c700 0 Supplied principal is discarded: arn:aws:iam::default:user -5> 2021-06-29T20:44:10.954+0000 7faedf66c700 2 req 2403 0.014001064s s3:init_multipart recalculating target -4> 2021-06-29T20:44:10.954+0000 7faedf66c700 2 req 2403 0.014001064s s3:init_multipart reading permissions -3> 2021-06-29T20:44:10.954+0000 7faedf66c700 2 req 2403 0.014001064s s3:init_multipart init op -2> 2021-06-29T20:44:10.954+0000 7faedf66c700 2 req 2403 0.014001064s s3:init_multipart verifying op mask -1> 2021-06-29T20:44:10.955+0000 7faedf66c700 2 req 2403 0.015001140s s3:init_multipart verifying op permissions 0> 2021-06-29T20:44:10.964+0000 7faedf66c700 -1 *** Caught signal (Segmentation fault) ** in thread 7faedf66c700 thread_name:radosgw ceph version 16.2.3 (381b476cb3900f9a92eb95d03b4850b953cfd79a) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7faf2dd05b20] 2: (rgw_bucket::rgw_bucket(rgw_bucket const&)+0x23) [0x7faf38b4d083] 3: (rgw::sal::RGWObject::get_obj() const+0x20) [0x7faf38b7bcf0] 4: (RGWInitMultipart::verify_permission(optional_yield)+0x6c) [0x7faf38e6608c] 5: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0x86a) [0x7faf38b2db1a] 6: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, int*)+0x26dd) [0x7faf38b3232d] 7: /lib64/libradosgw.so.2(+0x4a1c0b) [0x7faf38a83c0b] 8: /lib64/libradosgw.so.2(+0x4a36a4) [0x7faf38a856a4] 9: /lib64/libradosgw.so.2(+0x4a390e) [0x7faf38a8590e] 10: make_fcontext() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. -- Vincent Chu A-4: Advanced Research in Cyber Systems Los Alamos National Laboratory

2 years, 9 months

2
1
0 0

Having issues to start more than 24 OSDs per host

by Jan.Jansen＠gdata.de

Hello We did try to use Cephadm with Podman to start 44 OSDs per host which consistently stop after adding 24 OSDs per host. We did look into the cephadm.log on the problematic host and saw that the command `cephadm ceph-volume lvm list --format json` did stuck. We were the output of the command wasn't complete. Therefore, we tried to use compacted JSON and we could increase the number to 36 OSDs per host. If you need more information just ask. Podman version: 3.2.1 Ceph version: 16.2.4 OS version: Suse Leap 15.3 Greetings, Jan

2 years, 9 months

3
3
0 0

cephfs forward scrubbing docs

by Dan van der Ster

Hi, Today while debugging something we had a few questions that might lead to improving the cephfs forward scrub docs: https://docs.ceph.com/en/latest/cephfs/scrub/ tldr: 1. Should we document which sorts of issues that the forward scrub is able to fix? 2. Can we make it more visible (in docs) that scrubbing is not supported with multi-mds? 3. Isn't the new `ceph -s` scrub task status misleading with multi-mds? Details here: 1) We found a CephFS directory with a number of zero sized files: # ls -l ... -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:58 upload_fc501199e3e7abe6b574101cf34aeefb.png -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 12:23 upload_fce4f55348185fefa0abdd8d11095ba8.gif -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:54 upload_fd95b8358851f0dac22fb775046a6163.png ... The user claims that those files were non-zero sized last week. The sequence of zero sized files includes *all* files written between Nov 2 and 9. The user claims that his client was running out of memory, but this is now fixed. So I suspect that his ceph client (kernel 3.10.0-1127.19.1.el7.x86_64) was not behaving well. Anyway, I noticed that even though the dentries list 0 bytes, the underlying rados objects have data, and the data looks good. E.g. # rados get -p cephfs_data 200212e68b5.00000000 --namespace=xxx 200212e68b5.00000000 # file 200212e68b5.00000000 200212e68b5.00000000: PNG image data, 960 x 815, 8-bit/color RGBA, non-interlaced So I managed to recover the files doing something like this (using an input file mapping inode to filename) [see PS 0]. But I'm wondering if a forward scrub is able to fix this sort of problem directly? Should we document which sorts of issues that the forward scrub is able to fix? I anyway tried to scrub it, which led to: # ceph tell mds.cephflax-mds-xxx scrub start /volumes/_nogroup/xxx recursive repair Scrub is not currently supported for multiple active MDS. Please reduce max_mds to 1 and then scrub. So ... 2) Shouldn't we update the doc to mention loud and clear that scrub is not currently supported for multiple active MDS? 3) I was somehow surprised by this, because I had thought that the new `ceph -s` multi-mds scrub status implied that multi-mds scrubbing was now working: task status: scrub status: mds.x: idle mds.y: idle mds.z: idle Is it worth reporting this task status for cephfs if we can't even scrub them? Thanks!! Dan [0] mkdir -p recovered while read -r a b; do for i in {0..9} do echo "rados stat --cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x" $a).0000000$i "&&" "rados get --cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x" $a).0000000$i $(printf "%x" $a).0000000$i done echo cat $(printf "%x" $a).* ">" $(printf "%x" $a) echo mv $(printf "%x" $a) recovered/$b done < inones_fnames.txt

2 years, 9 months

2
2
0 0

Semantics of cephfs-mirror

by Manuel Holtgrewe

Dear all, I'm sorry if I'm asking for the obvious or missing a previous discussion of this but I could not find the answer to my question online. I'd be happy to be pointed to the right direction only. The cephfs-mirror tool in pacific looks extremely promising. How does it work exactly? Is it based on files and (recursive) ctime or rather based on object information? Does it handle incremental changes (only) between snapshots? There is an issue related to this that mentions recursive ctime. But that would mean that users could "rsync -a" data to the file system and this would not get synchronized. I have good experience with ZFS which is able to identify changes between two snapshots A and B and then only transfer these changes (using a sub-file level, on the ZFS equivalent of blocks to my understanding) to another server with the same file system that is in the exact state as snapshot A. Does cephfs-mirror work the same? Best wishes, Manuel

2 years, 9 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2021