June 2023 - ceph-users - lists.ceph.io

ceph.conf and two different ceph clusters

by garcetto

good afternoon, how can i config ceph.conf file on a generic rbd client to say to use two different ceph clusters to access different volumes on them? ceph-cluster-left --> rbd-vol-green ceph-cluster-right --> rbd-vol-blue thank you.

10 months, 4 weeks

2
1
0 0

Re: cephadm and remoto package

by Florian Haas

Hi Shashi, I just ran into this myself, and I thought I'd share the solution/workaround that I applied. On 15/05/2023 22:08, Shashi Dahal wrote: > Hi, > I followed this documentation: > > https://docs.ceph.com/en/pacific/cephadm/adoption/ > > This is the error I get when trying to enable cephadm. > > ceph mgr module enable cephadm > > Error ENOENT: module 'cephadm' reports that it cannot run on the active > manager daemon: loading remoto library:No module named 'remoto' (pass > --force to force enablement) > > When I import remoto, it imports just fine. > > > OS is ubuntu 20.04 focal As far as I can see, this issue applies to non-containerized Ceph Pacific deployments — such as ones orchestrated with ceph-ansible — running on Debian or Ubuntu. There is no python3-remoto package on those platforms, so you can't install remoto by "regular" installation means (that is, apt/apt-get). It looks to me like this issue was introduced in Pacific, and then went away in Quincy because that release dropped remoto and replaced it with asyncssh (for which a Debian/Ubuntu package does exist). If you start out on Octopus with ceph-ansible and do the Cephadm migration *then*, you're apparently fine too, and you can subsequently use Cephadm to upgrade to Pacific and Quincy. I think it's just this particular combination — (a) run on Debian/Ubuntu, (b) deploy non-containerized, *and* (c) start your deployment on Pacific, where Cephadm adoption breaks. The problem has apparently been known for a while (see https://tracker.ceph.com/issues/43415), but the recommendation appears to have been "just run mgr on a different OS then", which is frequently not a viable option. I tried (like you did, I assume) to just pip-install remoto, and if I opened a Python console and typed "import remoto" it imported just fine, but apparently the cephadm mgr module didn't like that. I've now traced this down to the following line that shows up in the ceph-mgr log if you bump "debug mgr" to 10/10: 2023-06-26T10:01:34.799+0000 7fb0979ba500 10 mgr[py] Computed sys.path '/usr/share/ceph/mgr:/local/lib/python3.8/dist-packages:/lib/python3/dist-packages:/lib/python3.8/dist-packages:lib/python38.zip:/lib/python3.8:/lib/python3.8/lib-dynload' Note the /local/lib/python3.8/dist-packages path, which does not exist on Ubuntu Focal. It's properly /usr/local/lib/python3.8/dist-packages, and this is where "pip install", when run as root outside a virtualenv, installs packages to. I think the incorrect sys.path may actually be a build or packaging bug in the community packages built for Debian/Ubuntu, but I'm not 100% certain. At any rate, the combined workaround for this issue, for me, is: (1) pip install remoto (this installs remoto into /usr/local/lib/python3.8/dist-packages) (2) ln -s /usr/local/lib/python3.8/dist-packages /local/lib/python3.8/dist-packages (this makes pip-installed packages available to ceph-mgr) (3) restart all ceph-mgr instances (4) ceph mgr module enable cephadm Cheers, Florian

10 months, 4 weeks

1
0
0 0

radosgw hang under pressure

by grin

Hello, ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) There is a single (test) radosgw serving plenty of test traffic. When under heavy req/s ("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: low traffic threads seem to work (like handling occasional PUTs) but GETs are completely nonresponsive, all attention seems to be spent on futexes. The effect is extremely similar to https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-ar… (subject: Radosgw (civetweb) hangs once around) except this is quincy so it's beast instead of civetweb. The effect is the same as described there, except the cluster is way smaller (about 20-40 OSDs). I observed that when I start radosgw -f with debug 20/20 it almost never hangs, so my guess is some ugly race condition. However I am a bit clueless how to actually debug it since debugging makes it go away. Debug 1 (default) with -d seems to hang after a while but it's not that simple to induce, I'm still testing under 4/4. Also I do not see much to configure about beast. As to answer the question in the original (2016) thread: - Debian stable - no visible limits issue - no obvious memory leak observed - no other visible resource shortage - strace says everyone's waiting on futexes, about 600-800 threads, apart from the one serving occasional PUTs - tcp port doesn't respond. IRC didn't react. ;-) Thanks, Peter

10 months, 4 weeks

4
4
0 0

copy file in nfs over cephfs error "error: error in file IO (code 11)"

by farhad kh

hi everybody we have problem with nfs gansha load balancer whene use rsync -avre to copy file from another share to ceph nfs share path we get this error `rsync -rav /mnt/elasticsearch/newLogCluster/acr-202* /archive/Elastic-v7-archive` rsync : close failed on "/archive/Elastic-v7-archive/...." : Input/output error (5) rsync error: error in file IO (code 11) at receiver.c(586) [Receiver=3.1.3]" we used ingress for load balancing nfs service and No other problems are observed in the cluster. Below is information about the pool, volume path and quota ------------ amount10.20.32.161:/volumes/arch-1/arch 30T 5.0T 26T 17% /archive# ceph osd pool get-quota arch-bigdata-data quotas for pool 'arch-bigdata-data': max objects: N/A max bytes : 30 TiB (current num bytes: 5488192308978 bytes) --------------- # ceph fs subvolume info arch-bigdata arch arch-1 { "atime": "2023-06-11 13:32:22", "bytes_pcent": "16.64", "bytes_quota": 32985348833280, "bytes_used": 5488566602388, "created_at": "2023-06-11 13:32:22", "ctime": "2023-06-25 10:45:35", "data_pool": "arch-bigdata-data", "features": [ "snapshot-clone", "snapshot-autoprotect", "snapshot-retention" ], "gid": 0, "mode": 16877, "mon_addrs": [ "10.20.32.153:6789", "10.20.32.155:6789", "10.20.32.154:6789" ], "mtime": "2023-06-25 10:38:48", "path": "/volumes/arch-1/arch/f246a31b-7103-41b9-8005-63d00efe88e4", "pool_namespace": "", "state": "complete", "type": "subvolume", "uid": 0 } .Has anyone ever experienced this error? What way do you suggest to solve it?

10 months, 4 weeks

1
0
0 0

alerts in dashboard

by Ben

Hi, I got many critical alerts in ceph dashboard. Meanwhile the cluster shows health ok status. See attached screenshot for detail. My questions are, are they real alerts? How to get rid of them? Thanks Ben

10 months, 4 weeks

3
3
0 0

Changing bucket owner in a multi-zonegroup Ceph cluster

by Ramin Najjarbashi

Hi all I have a Ceph cluster consisting of two zonegroups with metadata syncing enabled. I need to change the owner of a bucket that is located in the secondary zonegroup. I followed the steps below: Unlinked the bucket from the old user on the secondary zonegroup: bash Copy code $ radosgw-admin bucket unlink --uid OLD_UID -b test-change-owner Linked the bucket to the new user on the secondary zonegroup: bash Copy code $ radosgw-admin bucket link --uid NEW_UID -b test-change-owner Changed the owner of the bucket on the primary (master) zonegroup: bash Copy code $ radosgw-admin bucket chown --uid NEW_UID -b test-change-owner After executing the last command on the primary zonegroup, the bucket owner was successfully changed. However, the ownership of the objects within the bucket still remains with the old user. When I executed the same radosgw-admin bucket chown command on the secondary zonegroup, I received a warning about inconsistent metadata between zones, but the bucket owner was changed successfully on the secondary zonegroup. My questions are: What is the best way to change the owner of a bucket in a multi-zonegroup cluster? What are the potential impacts of running the chown command on the secondary zonegroup? Is it possible to have inconsistent metadata between zones in this case?

10 months, 4 weeks

1
0
0 0

OSDs cannot join cluster anymore

by Malte Stroem

Hello, we removed some nodes from our cluster. This worked without problems. Now, lots of OSDs do not want to join the cluster anymore if we reboot one of the still available nodes. It always runs into timeouts: --> ceph-volume lvm activate successful for osd ID: XX monclient(hunting): authenticate timed out after 300 MONs and MGRs are running fine. Network is working, netcat to the MONs' ports are open. Setting a higher debug level has no effect even if we add it to the ceph.conf file. The PGs are pretty unhappy, e. g.: 7.143 87771 0 0 0 0 314744902235 0 0 10081 10081 down 2023-06-20T09:16:03.546158+0000 961275'1395646 961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE] 209 961231'1395512 2023-06-19T23:46:40.101791+0000 961231'1395512 2023-06-19T23:46:40.101791+0000 PG query wants us to set an OSD lost however I do not want to do this. OSDs are blocked by OSDs from the removed nodes: ceph osd blocked-by osd num_blocked 152 38 244 41 144 54 ... We added the removed hosts again and tried to start the OSDs on this node and they also failed into the timeout mentioned above. This is a containerized cluster running version 16.2.10. Replication is 3, some pools use an erasure coded profile. Best regards, Malte

11 months

4
9
0 0

users caps change unexpected

by Alessandro Italiano

Hi we have a brand new ceph instance deployed by ceph puppet module. We are experiencing a funny issues. users caps change unexpected. logs do not report any message about the user caps even with auth/debug_auth: 5/5 who/what can change the caps ? thanks in advance Ale root@cephmon1:~# ceph version ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) root@cephmon1:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy root@cephmon1:~# root@cephmon1:~# ceph auth del client.cinder-backup updated root@cephmon1:~# root@cephmon1:~# root@cephmon1:~# root@cephmon1:~# date Fri Jun 23 07:52:40 AM CEST 2023 root@cephmon1:~# root@cephmon1:~# ceph auth add client.cinder-backup added key for client.cinder-backup root@cephmon1:~# ceph auth caps client.cinder-backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=backup' mgr 'profile rbd pool=backup' updated caps for client.cinder-backup root@cephmon1:~# root@cephmon1:~# root@cephmon1:~# root@cephmon1:~# date Fri Jun 23 07:53:18 AM CEST 2023 root@cephmon1:~# ceph auth list client.cinder-backup key: AQBEM5VkhIfJHBAA6WP9P3HHCTSySdTqZv4Ypg== caps: [mgr] profile rbd pool=backup caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=backup root@cephmon1:~# ceph auth list client.cinder-backup key: AQAM0nJi8OYAFBAA+p+T2QWtKaq92Z/hFMgF4w== caps: [mgr] profile rbd pool=backups caps: [mon] profile rbd caps: [osd] profile rbd pool=backups root@cephmon1:~# date Fri Jun 23 07:56:42 AM CEST 2023

11 months

2
1
0 0

Re: Grafana service fails to start due to bad directory name after Quincy upgrade

by Adiga, Anantha

Hi Eugene, Thank you for your response, here is the update. The upgrade to Quincy was done following the cephadm orch upgrade procedure ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6 Upgrade completed with out errors. After the upgrade, upon creating the Grafana service from Ceph dashboard, it deployed Grafana 6.7.4. The version is hardcoded in the code, should it not be 8.3.5 as listed below in Quincy documentation? See below [Grafana service started from Cephdashboard] Quincy documentation states: https://docs.ceph.com/en/latest/releases/quincy/ ……documentation snippet Monitoring and alerting: 43 new alerts have been added (totalling 68) improving observability of events affecting: cluster health, monitors, storage devices, PGs and CephFS. Alerts can now be sent externally as SNMP traps via the new SNMP gateway service (the MIB is provided). Improved integrated full/nearfull event notifications. Grafana Dashboards now use grafonnet format (though they’re still available in JSON format). Stack update: images for monitoring containers have been updated. Grafana 8.3.5, Prometheus 2.33.4, Alertmanager 0.23.0 and Node Exporter 1.3.1. This reduced exposure to several Grafana vulnerabilities (CVE-2021-43798, CVE-2021-39226, CVE-2021-43798, CVE-2020-29510, CVE-2020-29511). …………………. I notice that the versions of the remaining stack, that Ceph dashboard deploys, are also older than what is documented. Prometheus 2.7.2, Alertmanager 0.16.2 and Node Exporter 0.17.0. AND 6.7.4 Grafana service reports a few warnings: highlighted below root@fl31ca104ja0201:/home/general# systemctl status ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104ja0201.service ● ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104ja0201.service - Ceph grafana.fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e Loaded: loaded (/etc/systemd/system/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-06-13 03:37:58 UTC; 11h ago Main PID: 391896 (bash) Tasks: 53 (limit: 618607) Memory: 17.9M CGroup: /system.slice/system-ceph\x2dd0a3b6e0\x2dd2c3\x2d11ed\x2dbe05\x2da7a3a1d7a87e.slice/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104j> ├─391896 /bin/bash /var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana.fl31ca104ja0201/unit.run └─391969 /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --init --name ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-grafana-fl> -- Logs begin at Sun 2023-06-11 20:41:51 UTC, end at Tue 2023-06-13 15:35:12 UTC. -- Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="alter user_auth.auth_id to length 190" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth access token to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth refresh token to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth token type to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth expiry to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add index to user_id column in user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create server_lock table" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add index server_lock.operation_uid" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create user auth token table" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user_auth_token.auth_token" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user_auth_token.prev_auth_token" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create cache_data table" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index cache_data.cache_key" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Created default organization" logger=sqlstore Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing HTTPServer" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing BackendPluginManager" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing PluginManager" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Starting plugin search" logger=plugins Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing HooksService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing OSSLicensingService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing InternalMetricsService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing RemoteCache" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing RenderingService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing AlertEngine" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing QuotaService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing ServerLockService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing UserAuthTokenService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing DatasourceCacheService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing LoginService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing SearchService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing TracingService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing UsageStatsService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing CleanUpService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing NotificationService" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing provisioningServiceImpl" logger=server Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=warn msg="[Deprecated] the datasource provisioning config is outdated. please upgrade" logger=provisioning.datasources filename=/etc/grafana/provisioning/datasources/ceph-dashboard.yml This warning comes due to the missing “ apiVersion: 1” first line entry in /etc/grafana/provisioning/datasources/ceph-dashboard.yml created by cephadm. If the file is modified to include the apiversion line and restart Grafana service, Is this a known ISSUE ? Here is the content of the ceph-dashboard.yml produced by cephadm deleteDatasources: - name: 'Dashboard1' orgId: 1 - name: 'Loki' orgId: 2 datasources: - name: 'Dashboard1' type: 'prometheus' access: 'proxy' orgId: 1 url: 'http://fl31ca104ja0201.xxx.xxx.com:9095' basicAuth: false isDefault: true editable: false - name: 'Loki' type: 'loki' access: 'proxy' orgId: 2 url: '' basicAuth: false isDefault: true editable: false -------------------------------------------------------------- Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Dashboard1 Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Loki Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Backend rendering via phantomJS" logger=rendering renderer=phantomJS Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=warn msg="phantomJS is deprecated and will be removed in a future release. You should consider migrating from phantomJS to grafana-image-renderer plugin. Read more at https://grafana.com/docs/grafana/latest/administration/image_rendering/" logger=rendering renderer=phantomJS Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing Stream Manager" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=[::]:3000 protocol=https subUrl= socket= I also had to change a few other things to keep all the services running. The last issue that I have not been able to resolve yet is the Cephbash board gives this error even though grafana is running on the same server. However, the grafana dashboard cannot be accessed without tunnelling. [cid:image002.png@01D9A10B.F8B9D220]

11 months

3
4
0 0

Ceph iSCSI GW is too slow when compared with Raw RBD performance

by Work Ceph

Hello guys, We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows clients. We started noticing some unexpected performance issues with iSCSI. I mean, an SSD pool is reaching 100MB of write speed for an image, when it can reach up to 600MB+ of write speed for the same image when mounted and consumed directly via RBD. Is that performance degradation expected? We would expect some degradation, but not as much as this one. Also, we have a question regarding the use of Intel Turbo boost. Should we disable it? Is it possible that the root cause of the slowness in the iSCSI GW is caused by the use of Intel Turbo boost feature, which reduces the clock of some cores? Any feedback is much appreciated.

11 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2023