April 2023 - ceph-users - lists.ceph.io

cephadm upgrade 16.2.10 to 16.2.11: osds crash and get stuck restarting

by Zakhar Kirpichenko

Hi, Attempted to upgrade 16.2.10 to 16.2.11, 2 OSDs out of many started crashing in a loop on the very 1st host: Jan 25 23:07:53 ceph01 bash[2553123]: Uptime(secs): 0.0 total, 0.0 interval Jan 25 23:07:53 ceph01 bash[2553123]: Flush(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(GB): cumulative 0.000, interval 0.000 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Total Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(L0 Files): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: AddFile(Keys): cumulative 0, interval 0 Jan 25 23:07:53 ceph01 bash[2553123]: Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Jan 25 23:07:53 ceph01 bash[2553123]: Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count Jan 25 23:07:53 ceph01 bash[2553123]: ** File Read Latency Histogram By Level [P] ** Jan 25 23:07:53 ceph01 bash[2553123]: debug -10> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986439) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -9> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986493) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -8> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986500) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -7> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: (Original Log Time 2023/01/25-23:07:52.986505) [db_impl/db_impl_compaction_flush.cc:2611] Com paction nothing to do Jan 25 23:07:53 ceph01 bash[2553123]: debug -6> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1676] [O-2] [JOB 9] Compacting 4@0 + 2@1 files to L1, score 1.0 0 Jan 25 23:07:53 ceph01 bash[2553123]: debug -5> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: [compaction/compaction_job.cc:1680] [O-2] Compaction start summary: Base version 33 Base level 0, inputs: [649058(959KB) 649046(1510KB) 649024(1323KB) 649002(1396KB)], [648981(66MB) 648982(52MB)] Jan 25 23:07:53 ceph01 bash[2553123]: debug -4> 2023-01-25T23:07:52.982+0000 7f7c87e67700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1674688072986547, "job": 9, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [649058, 649046, 649024, 649002], "files_L1": [648981, 648982], "score": 1, "input_data_size": 129161327} Jan 25 23:07:53 ceph01 bash[2553123]: debug -3> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 bdev(0x5619bf0ce400 /var/lib/ceph/osd/ceph-3/block) _aio_thread got r=-1 ((1) Operation not permitted) Jan 25 23:07:53 ceph01 bash[2553123]: debug -2> 2023-01-25T23:07:52.990+0000 7f7c8b66e700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f7c8b66e700 time 2023-01-25T23:07:52.993976+0000 Jan 25 23:07:53 ceph01 bash[2553123]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.11/rpm/el8/BUILD/ceph-16.2.11/src/blk/kernel/KernelDevice.cc: 604: ceph_abort_msg("Unexpected IO error. This may suggest HW issue. Please check your dmesg!") Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x5619b2fc1adc] Jan 25 23:07:53 ceph01 bash[2553123]: 2: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 3: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 4: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] Jan 25 23:07:53 ceph01 bash[2553123]: 5: clone() Jan 25 23:07:53 ceph01 bash[2553123]: debug -1> 2023-01-25T23:07:52.994+0000 7f7c8b66e700 -1 *** Caught signal (Aborted) ** Jan 25 23:07:53 ceph01 bash[2553123]: in thread 7f7c8b66e700 thread_name:bstore_aio Jan 25 23:07:53 ceph01 bash[2553123]: ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable) Jan 25 23:07:53 ceph01 bash[2553123]: 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f7c9788fcf0] Jan 25 23:07:53 ceph01 bash[2553123]: 2: gsignal() Jan 25 23:07:53 ceph01 bash[2553123]: 3: abort() Jan 25 23:07:53 ceph01 bash[2553123]: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x5619b2fc1bad] Jan 25 23:07:53 ceph01 bash[2553123]: 5: (KernelDevice::_aio_thread()+0x1285) [0x5619b3b2a4e5] Jan 25 23:07:53 ceph01 bash[2553123]: 6: (KernelDevice::AioCompletionThread::entry()+0x11) [0x5619b3b357b1] Jan 25 23:07:53 ceph01 bash[2553123]: 7: /lib64/libpthread.so.0(+0x81ca) [0x7f7c978851ca] OSD kept crashing until the host reboot, an OSD restart wouldn't help. This hasn't happened during any previous upgrades, so was a rather unexpected development. Unclear what caused this, but a host reboot seems to have fixed it. It happened to 1 other OSD on another host, exactly the same symptoms, also solved by a reboot. Best regards, Zakhar

9 months, 4 weeks

4
4
0 0

upgrading from 15.2.17 to 16.2.11 - Health ERROR

by xadhoom76＠gmail.com

hi , starting upgrade from 15.2.17 i got this error Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0) Cluster was in health ok before starting.

10 months, 1 week

5
9
0 0

Help needed to configure erasure coding LRC plugin

by Michel Jouvin

Hi, As discussed in another thread (Crushmap rule for multi-datacenter erasure coding), I'm trying to create an EC pool spanning 3 datacenters (datacenters are present in the crushmap), with the objective to be resilient to 1 DC down, at least keeping the readonly access to the pool and if possible the read-write access, and have a storage efficiency better than 3 replica (let say a storage overhead <= 2). In the discussion, somebody mentioned LRC plugin as a possible jerasure alternative to implement this without tweaking the crushmap rule to implement the 2-step OSD allocation. I looked at the documentation (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but I have some questions if someone has experience/expertise with this LRC plugin. I tried to create a rule for using 5 OSDs per datacenter (15 in total), with 3 (9 in total) being data chunks and others being coding chunks. For this, based of my understanding of examples, I used k=9, m=3, l=4. Is it right? Is this configuration equivalent, in terms of redundancy, to a jerasure configuration with k=9, m=6? The resulting rule, which looks correct to me, is: -------- { "rule_id": 6, "rule_name": "test_lrc_2", "ruleset": 6, "type": 3, "min_size": 3, "max_size": 15, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -4, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 3, "type": "datacenter" }, { "op": "chooseleaf_indep", "num": 5, "type": "host" }, { "op": "emit" } ] } ------------ Unfortunately, it doesn't work as expected: a pool created with this rule ends up with its pages active+undersize, which is unexpected for me. Looking at 'ceph health detail` output, I see for each page something like: pg 52.14 is stuck undersized for 27m, current state active+undersized, last acting [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] For each PG, there is 3 '2147483647' entries and I guess it is the reason of the problem. What are these entries about? Clearly it is not OSD entries... Looks like a negative number, -1, which in terms of crushmap ID is the crushmap root (named "default" in our configuration). Any trivial mistake I would have made? Thanks in advance for any help or for sharing any successful configuration? Best regards, Michel

10 months, 1 week

4
20
0 0

RadosGW S3 API Multi-Tenancy

by Brad House

I'm in the process of exploring if it is worthwhile to add RadosGW to our existing ceph cluster. We've had a few internal requests for exposing the S3 API for some of our business units, right now we just use the ceph cluster for VM disk image storage via RBD. Everything looks pretty straight forward until we hit multitenancy. The page on multi-tenancy doesn't dive into permission delegation: https://docs.ceph.com/en/quincy/radosgw/multitenancy/ The end goal I want is to be able to create a single user per tenant (Business Unit) which will act as their 'administrator', where they can then do basically whatever they want under their tenant sandbox (though I don't think we need more advanced cases like creations of roles or policies, just create/delete their own users, buckets, objects). I was hopeful this would just work, and I asked on the ceph IRC channel on OFTC and was told once I grant a user caps="users=*", they would then be allowed to create users *outside* of their own tenant using the Rados Admin API and that I should explore IAM roles. I think it would make sense to add a feature, such as a flag that can be set on a user, to ensure they stay in their "sandbox". I'd assume this is probably a common use-case. Anyhow, if its possible to do today using iam roles/policies, then great, unfortunately this is my first time looking at this stuff and there are some things not immediately obvious. I saw this online about AWS itself and creating a permissions boundary, but that's for allowing creation of roles within a boundary: https://www.qloudx.com/delegate-aws-iam-user-and-role-creation-without-givi… I'm not sure what "Action" is associated with the Rados Admin API create user for applying a boundary that the user can only create users with the same tenant name. https://docs.ceph.com/en/quincy/radosgw/adminops/#create-user Any guidance on this would be extremely helpful. Thanks! -Brad

11 months

1
1
0 0

Small RGW objects and RADOS 64KB minimun size

by Loïc Dachary

Bonjour, Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads: > we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter. There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects. That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same? Cheers [0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond [1] https://docs.ceph.com/en/latest/radosgw/layout/ [2] https://github.com/ceph/ceph/pull/32809 [3] https://www.spinics.net/lists/ceph-users/msg45755.html -- Loïc Dachary, Artisan Logiciel Libre

11 months, 1 week

4
6
0 0

ln: failed to create hard link 'file name': Read-only file system

by Frank Schilder

Hi all, on an NFS re-export of a ceph-fs (kernel client) I observe a very strange error. I'm un-taring a larger package (1.2G) and after some time I get these errors: ln: failed to create hard link 'file name': Read-only file system The strange thing is that this seems only temporary. When I used "ln src dst" for manual testing, the command failed as above. However, after that I tried "ln -v src dst" and this command created the hard link with exactly the same path arguments. During the period when the error occurs, I can't see any FS in read-only mode, neither on the NFS client nor the NFS server. Funny thing is that file creation and write still works, its only the hard-link creation that fails. For details, the set-up is: file-server: mount ceph-fs at /shares/path, export /shares/path as nfs4 to other server other server: mount /shares/path as NFS More precisely, on the file-server: fstab: MON-IPs:/shares/folder /shares/nfs/folder ceph defaults,noshare,name=NAME,secretfile=sec.file,mds_namespace=FS-NAME,_netdev 0 0 exports: /shares/nfs/folder -no_root_squash,rw,async,mountpoint,no_subtree_check DEST-IP On the host at DEST-IP: fstab: FILE-SERVER-IP:/shares/nfs/folder /mnt/folder nfs defaults,_netdev 0 0 Both, the file server and the client server are virtual machines. The file server is on Centos 8 stream (4.18.0-338.el8.x86_64) and the client machine is on AlmaLinux 8 (4.18.0-425.13.1.el8_7.x86_64). When I change the NFS export from "async" to "sync" everything works. However, that's a rather bad workaround and not a solution. Although this looks like an NFS issue, I'm afraid it is a problem with hard links and ceph-fs. It looks like a race with scheduling and executing operations on the ceph-fs kernel mount. Has anyone seen something like that? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

11 months, 2 weeks

4
13
0 0

Re: MDS crashes to damaged metadata

by Patrick Donnelly

On Thu, Dec 15, 2022 at 9:32 AM Stolte, Felix <f.stolte(a)fz-juelich.de> wrote: > > Hi Patrick, > > we used your script to repair the damaged objects on the weekend and it went smoothly. Thanks for your support. > > We adjusted your script to scan for damaged files on a daily basis, runtime is about 6h. Until thursday last week, we had exactly the same 17 Files. On thursday at 13:05 a snapshot was created and our active mds crashed once at this time (snapshot was created): > > 2022-12-08T13:05:48.919+0100 7f440afec700 -1 /build/ceph-16.2.10/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f440afec700 time 2022-12-08T13:05:48.921223+0100 > /build/ceph-16.2.10/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state LOCK_XLOCK || state LOCK_XLOCKDONE) > > 12 Minutes lates the unlink_local error crashes appeared again. This time with a new file. During debugging we noticed a MTU mismatch between MDS (1500) and client (9000) with cephfs kernel mount. The client is also creating the snapshots via mkdir in the .snap directory. > > We disabled snapshot creation for now, but really need this feature. I uploaded the mds logs of the first crash along with the information above to https://tracker.ceph.com/issues/38452 > > I would greatly appreciate it, if you could answer me the following question: > > Is the Bug related to our MTU Mismatch? We fixed the MTU Issue going back to 1500 on all nodes in the ceph public network on the weekend also. I doubt it. > If you need a debug level 20 log of the ScatterLock for further analysis, i could schedule snapshots at the end of our workdays and increase the debug level 5 Minutes arround snap shot creation. This would be very helpful! -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

11 months, 2 weeks

2
2
0 0

Radosgw multisite replication issues

by Tarrago, Eli (RIS-BCT)

Good Afternoon, I am experiencing an issue where east-1 is no longer able to replicate from west-1, however, after a realm pull, west-1 is now able to replicate from east-1. In other words: West <- Can Replicate <- East West -> Cannot Replicate -> East After confirming the access and secret keys are identical on both sides, I restarted all radosgw services. Here is the current status of the cluster below. Thank you for your help, Eli Tarrago root@east01:~# radosgw-admin zone get { "id": "ddd66ab8-0417-46ee-a53b-043352a63f93", "name": "rgw-east", "domain_root": "rgw-east.rgw.meta:root", "control_pool": "rgw-east.rgw.control", "gc_pool": "rgw-east.rgw.log:gc", "lc_pool": "rgw-east.rgw.log:lc", "log_pool": "rgw-east.rgw.log", "intent_log_pool": "rgw-east.rgw.log:intent", "usage_log_pool": "rgw-east.rgw.log:usage", "roles_pool": "rgw-east.rgw.meta:roles", "reshard_pool": "rgw-east.rgw.log:reshard", "user_keys_pool": "rgw-east.rgw.meta:users.keys", "user_email_pool": "rgw-east.rgw.meta:users.email", "user_swift_pool": "rgw-east.rgw.meta:users.swift", "user_uid_pool": "rgw-east.rgw.meta:users.uid", "otp_pool": "rgw-east.rgw.otp", "system_key": { "access_key": "PxxxxxxxxxxxxxxxxW", "secret_key": "Hxxxxxxxxxxxxxxxx6" }, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": "rgw-east.rgw.buckets.index", "storage_classes": { "STANDARD": { "data_pool": "rgw-east.rgw.buckets.data" } }, "data_extra_pool": "rgw-east.rgw.buckets.non-ec", "index_type": 0 } } ], "realm_id": "98e0e391-16fb-48da-80a5-08437fd81789", "notif_pool": "rgw-east.rgw.log:notif" } root@west01:~# radosgw-admin zone get { "id": "b2a4a31c-1505-4fdc-b2e0-ea07d9463da1", "name": "rgw-west", "domain_root": "rgw-west.rgw.meta:root", "control_pool": "rgw-west.rgw.control", "gc_pool": "rgw-west.rgw.log:gc", "lc_pool": "rgw-west.rgw.log:lc", "log_pool": "rgw-west.rgw.log", "intent_log_pool": "rgw-west.rgw.log:intent", "usage_log_pool": "rgw-west.rgw.log:usage", "roles_pool": "rgw-west.rgw.meta:roles", "reshard_pool": "rgw-west.rgw.log:reshard", "user_keys_pool": "rgw-west.rgw.meta:users.keys", "user_email_pool": "rgw-west.rgw.meta:users.email", "user_swift_pool": "rgw-west.rgw.meta:users.swift", "user_uid_pool": "rgw-west.rgw.meta:users.uid", "otp_pool": "rgw-west.rgw.otp", "system_key": { "access_key": "PxxxxxxxxxxxxxxW", "secret_key": "Hxxxxxxxxxxxxxx6" }, "placement_pools": [ { "key": "default-placement", "val": { "index_pool": "rgw-west.rgw.buckets.index", "storage_classes": { "STANDARD": { "data_pool": "rgw-west.rgw.buckets.data" } }, "data_extra_pool": "rgw-west.rgw.buckets.non-ec", "index_type": 0 } } ], "realm_id": "98e0e391-16fb-48da-80a5-08437fd81789", "notif_pool": "rgw-west.rgw.log:notif" east01:~# radosgw-admin metadata sync status { "sync_status": { "info": { "status": "init", "num_shards": 0, "period": "", "realm_epoch": 0 }, "markers": [] }, "full_sync": { "total": 0, "complete": 0 } } west01:~# radosgw-admin metadata sync status { "sync_status": { "info": { "status": "sync", "num_shards": 64, "period": "44b6b308-e2d8-4835-8518-c90447e7b55c", "realm_epoch": 3 }, "markers": [ { "key": 0, "val": { "state": 1, "marker": "", "next_step_marker": "", "total_entries": 46, "pos": 0, "timestamp": "0.000000", "realm_epoch": 3 } }, #### goes on for a long time… { "key": 63, "val": { "state": 1, "marker": "", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "0.000000", "realm_epoch": 3 } } ] }, "full_sync": { "total": 46, "complete": 46 } } east01:~# radosgw-admin sync status realm 98e0e391-16fb-48da-80a5-08437fd81789 (rgw-blobs) zonegroup 0e0faf4e-39f5-402e-9dbb-4a1cdc249ddd (EastWestceph) zone ddd66ab8-0417-46ee-a53b-043352a63f93 (rgw-east) metadata sync no sync (zone is master) 2023-04-20T19:03:13.388+0000 7f25fa036c80 0 ERROR: failed to fetch datalog info data sync source: b2a4a31c-1505-4fdc-b2e0-ea07d9463da1 (rgw-west) failed to retrieve sync info: (13) Permission denied west01:~# radosgw-admin sync status realm 98e0e391-16fb-48da-80a5-08437fd81789 (rgw-blobs) zonegroup 0e0faf4e-39f5-402e-9dbb-4a1cdc249ddd (EastWestceph) zone b2a4a31c-1505-4fdc-b2e0-ea07d9463da1 (rgw-west) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: ddd66ab8-0417-46ee-a53b-043352a63f93 (rgw-east) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 16 shards behind shards: [5,56,62,65,66,70,76,86,87,94,104,107,111,113,120,126] oldest incremental change not applied: 2023-04-20T19:02:48.783283+0000 [5] east01:~# radosgw-admin zonegroup get { "id": "0e0faf4e-39f5-402e-9dbb-4a1cdc249ddd", "name": "EastWestceph", "api_name": "EastWestceph", "is_master": "true", "endpoints": [ http://east01.example.net:8080, http://east02.example.net:8080, http://east03.example.net:8080, http://west01.example.net:8080, http://west02.example.net:8080, http://west03.example.net:8080 ], "hostnames": [ "eastvip.example.net", "westvip.example.net" ], "hostnames_s3website": [], "master_zone": "ddd66ab8-0417-46ee-a53b-043352a63f93", "zones": [ { "id": "b2a4a31c-1505-4fdc-b2e0-ea07d9463da1", "name": "rgw-west", "endpoints": [ http://west01.example.net:8080, http://west02.example.net:8080, http://west03.example.net:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "ddd66ab8-0417-46ee-a53b-043352a63f93", "name": "rgw-east", "endpoints": [ http://east01.example.net:8080, http://east02.example.net:8080, http://east03.example.net:8080 ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "98e0e391-16fb-48da-80a5-08437fd81789", "sync_policy": { "groups": [] } } ________________________________ The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. This message may be an attorney-client communication and/or work product and as such is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message.

12 months

4
6
0 0

Lua scripting in the rados gateway

by Thomas Bennett

Hi ceph users, I've been trying out the lua scripting for the rados gateway (thanks Yuval). As in my previous email I mentioned that there is an error when trying to load the luasocket module. However, I thought it was a good time to report on my progress. My 'hello world' example below is called *test.lua* below includes the following checks: 1. Can I write to the debug log? 2. Can I use the lua socket package to do something stupid but intersting, like connect to a webservice? Before you continue reading this, you might need to know that I run all ceph processes in a *CentOS Stream release 8 *container deployed using ceph orchestrator running *Ceph v17.2.5*, so please view the information below in that context. For anyone looking for a reference, I suggest going to the ceph lua rados gateway documentation at radosgw/lua-scripting <https://docs.ceph.com/en/quincy/radosgw/lua-scripting/>. There are two new switches you need to know about in the radosgw-admin: - *script* -> loading your lua script - *script-package* -> loading supporting packages for your script - e.i. luasocket in this case. For a basic setup, you'll need to have a few dependencies in your containers: - cephadm container: requires luarocks (I've checked the code - it runs a luarocks search command) - radosgw container: requires luarocks, gcc, make, m4, wget (wget just in case). To achieve the above, I updated the container image for our running system. I needed to do this because I needed to redeploy the rados gateway container to inject the lua script packages into the radosgw runtime process. This will start with a fresh container based on the global config *container_image* setting on your running system. For us this is currently captured in *quay.io/tsolo/ceph:v17.2.5-3 <http://quay.io/tsolo/ceph:v17.2.5-3>* and included the following exta steps (including installing the lua dev from an rpm because there is no centos package in yum): yum install luarocks gcc make wget m4 rpm -i https://rpmfind.net/linux/centos/8-stream/PowerTools/x86_64/os/Packages/lua… You will notice that I've included a compiler and compiler support into the image. This is because luarocks on the radosgw to compile luasocket (the package I want to install). This will happen at start time when the radosgw is restarted from ceph orch. In the cephadm container I still need to update our cephadm shell so I need to install luarocks by hand: yum install luarocks Then set thew updated image to use: ceph config set global container_image quay.io/tsolo/ceph:v17.2.5-3 I now create a file called: *test.lua* in the cephadm container. This contains the following lines to write to the log and then do a get request to google. This is not practical in production, but it serves the purpose of testing the infrastructure: RGWDebugLog("Tsolo start lua script") local LuaSocket = require("socket") client = LuaSocket.connect("google.com", 80) client:send("GET / HTTP/1.0\r\nHost: google.com\r\n\r\n") while true do s, status, partial = client:receive('*a') RGWDebugLog(s or partial) if status == "closed" then break end end client:close() RGWDebugLog("Tsolo stop lua") Next I run: radosgw-admin script-package add --package=luasocket --allow-compilation And then list the added package to make sure it is there: radosgw-admin script-package list Note - at this point the radosgw has not been modified, it must first be restarted. Then I put the *test.lua *script into the pre request context: radosgw-admin script put --infile=test.lua --context=preRequest You also need to raise the debug log level on the running rados gateway: ceph daemon /var/run/ceph/ceph-client.rgw.xxx.xxx-cms1.xxxxx.x.xxxxxxxxxxxxxx.asok config set debug_rgw 20 Inside the radosgw container I apply my fix (as per previous email): cp -ru /tmp/luarocks/client.rgw.xxxxxx.xxx-xxxx-xxxx.pcoulb/lib64/* /tmp/luarocks/client.rgw.xxxxxx.xxx-xxxx-xxxx.pcoulb/lib/ Outside on the host running the radosgw-admin container I follow the journalctl for the radosgw container (to get the logs): journalctl -fu ceph-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx@rgw. xxx.xxx-cms1.xxxxx.x.xxxxxxxxxxxxxx.service Then I run an s3cmd to put data in via the rados gateway and check the journalctl logs and see: Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: Tsolo start lua Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: HTTP/1.0 301 Moved Permanently Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: Tsolo stop lua Apr 25 20:54:47 brp-ceph-cms1 radosgw[60901]: Lua INFO: Tsolo start lua Apr 25 20:54:48 brp-ceph-cms1 radosgw[60901]: Lua INFO: HTTP/1.0 301 Moved Permanently Apr 25 20:54:48 brp-ceph-cms1 radosgw[60901]: Lua INFO: Apr 25 20:54:48 brp-ceph-cms1 radosgw[60901]: Lua INFO: Tsolo stop lua So the script worked :) If you want to see where the luarocks libraries have been installed, look in the /tmp/ directory of the radosgw container after you redeploy it and you will find the content in /tmp/luarocks. Conclusions: There was a bit to figure out to get this working, but now that I've got this simple test working I think there is a lot more to look into and discover and use w.r.t. this powerful tool. Cheers, Tom

12 months

2
5
0 0

How can I use not-replicated pool (replication 1 or raid-0)

by mhnx

Hello. I have a 10 node cluster. I want to create a non-replicated pool (replication 1) and I want to ask some questions about it: Let me tell you my use case: - I don't care about losing data, - All of my data is JUNK and these junk files are usually between 1KB to 32MB. - These files will be deleted in 5 days. - Writable space and I/O speed is more important. - I have high Write/Read/Delete operations, minimum 200GB a day. I'm afraid that, in any failure, I won't be able to access the whole cluster. Losing data is okay but I have to ignore missing files, remove the data from the cluster and continue with existing data and while doing this, I want to be able to write new data to the cluster. My questions are: 1- To reach this goal do you have any recommendations? 2- With this setup, what potential problems do you have in mind? 3- I think Erasure Coding is not a choice because of the performance problems and slow file deletion. With this I/O need EC will miss files and leaks may happen (I've seen before on Nautilus). 4- You read my needs, is there a better way to do this? Thank you for the answers. Best regards.

12 months

3
8
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2023