good afternoon,
how can i config ceph.conf file on a generic rbd client to say to use two
different ceph clusters to access different volumes on them?
ceph-cluster-left --> rbd-vol-green
ceph-cluster-right --> rbd-vol-blue
thank you.
Hi Shashi,
I just ran into this myself, and I thought I'd share the
solution/workaround that I applied.
On 15/05/2023 22:08, Shashi Dahal wrote:
> Hi,
> I followed this documentation:
>
> https://docs.ceph.com/en/pacific/cephadm/adoption/
>
> This is the error I get when trying to enable cephadm.
>
> ceph mgr module enable cephadm
>
> Error ENOENT: module 'cephadm' reports that it cannot run on the active
> manager daemon: loading remoto library:No module named 'remoto' (pass
> --force to force enablement)
>
> When I import remoto, it imports just fine.
>
>
> OS is ubuntu 20.04 focal
As far as I can see, this issue applies to non-containerized Ceph
Pacific deployments — such as ones orchestrated with ceph-ansible —
running on Debian or Ubuntu. There is no python3-remoto package on those
platforms, so you can't install remoto by "regular" installation means
(that is, apt/apt-get).
It looks to me like this issue was introduced in Pacific, and then went
away in Quincy because that release dropped remoto and replaced it with
asyncssh (for which a Debian/Ubuntu package does exist). If you start
out on Octopus with ceph-ansible and do the Cephadm migration *then*,
you're apparently fine too, and you can subsequently use Cephadm to
upgrade to Pacific and Quincy. I think it's just this particular
combination — (a) run on Debian/Ubuntu, (b) deploy non-containerized,
*and* (c) start your deployment on Pacific, where Cephadm adoption breaks.
The problem has apparently been known for a while (see
https://tracker.ceph.com/issues/43415), but the recommendation appears
to have been "just run mgr on a different OS then", which is frequently
not a viable option.
I tried (like you did, I assume) to just pip-install remoto, and if I
opened a Python console and typed "import remoto" it imported just fine,
but apparently the cephadm mgr module didn't like that.
I've now traced this down to the following line that shows up in the
ceph-mgr log if you bump "debug mgr" to 10/10:
2023-06-26T10:01:34.799+0000 7fb0979ba500 10 mgr[py] Computed sys.path
'/usr/share/ceph/mgr:/local/lib/python3.8/dist-packages:/lib/python3/dist-packages:/lib/python3.8/dist-packages:lib/python38.zip:/lib/python3.8:/lib/python3.8/lib-dynload'
Note the /local/lib/python3.8/dist-packages path, which does not exist
on Ubuntu Focal. It's properly /usr/local/lib/python3.8/dist-packages,
and this is where "pip install", when run as root outside a virtualenv,
installs packages to.
I think the incorrect sys.path may actually be a build or packaging bug
in the community packages built for Debian/Ubuntu, but I'm not 100% certain.
At any rate, the combined workaround for this issue, for me, is:
(1) pip install remoto (this installs remoto into
/usr/local/lib/python3.8/dist-packages)
(2) ln -s /usr/local/lib/python3.8/dist-packages
/local/lib/python3.8/dist-packages (this makes pip-installed packages
available to ceph-mgr)
(3) restart all ceph-mgr instances
(4) ceph mgr module enable cephadm
Cheers,
Florian
Hello,
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
There is a single (test) radosgw serving plenty of test traffic. When under heavy req/s ("heavy" in a low sense, about 1k rq/s) it pretty reliably hangs: low traffic threads seem to work (like handling occasional PUTs) but GETs are completely nonresponsive, all attention seems to be spent on futexes.
The effect is extremely similar to
https://ceph-users.ceph.narkive.com/I4uFVzH9/radosgw-civetweb-hangs-once-ar… (subject: Radosgw (civetweb) hangs once around)
except this is quincy so it's beast instead of civetweb. The effect is the same as described there, except the cluster is way smaller (about 20-40 OSDs).
I observed that when I start radosgw -f with debug 20/20 it almost never hangs, so my guess is some ugly race condition. However I am a bit clueless how to actually debug it since debugging makes it go away. Debug 1 (default) with -d seems to hang after a while but it's not that simple to induce, I'm still testing under 4/4.
Also I do not see much to configure about beast.
As to answer the question in the original (2016) thread:
- Debian stable
- no visible limits issue
- no obvious memory leak observed
- no other visible resource shortage
- strace says everyone's waiting on futexes, about 600-800 threads, apart from the one serving occasional PUTs
- tcp port doesn't respond.
IRC didn't react. ;-)
Thanks,
Peter
hi everybody
we have problem with nfs gansha load balancer
whene use rsync -avre to copy file from another share to ceph nfs share
path we get this error
`rsync -rav /mnt/elasticsearch/newLogCluster/acr-202*
/archive/Elastic-v7-archive`
rsync : close failed on "/archive/Elastic-v7-archive/...." :
Input/output error (5)
rsync error: error in file IO (code 11) at receiver.c(586) [Receiver=3.1.3]"
we used ingress for load balancing nfs service and No other problems are
observed in the cluster.
Below is information about the pool, volume path and quota
------------
amount10.20.32.161:/volumes/arch-1/arch 30T 5.0T 26T 17% /archive#
ceph osd pool get-quota arch-bigdata-data
quotas for pool 'arch-bigdata-data':
max objects: N/A
max bytes : 30 TiB (current num bytes: 5488192308978 bytes)
---------------
# ceph fs subvolume info arch-bigdata arch arch-1
{
"atime": "2023-06-11 13:32:22",
"bytes_pcent": "16.64",
"bytes_quota": 32985348833280,
"bytes_used": 5488566602388,
"created_at": "2023-06-11 13:32:22",
"ctime": "2023-06-25 10:45:35",
"data_pool": "arch-bigdata-data",
"features": [
"snapshot-clone",
"snapshot-autoprotect",
"snapshot-retention"
],
"gid": 0,
"mode": 16877,
"mon_addrs": [
"10.20.32.153:6789",
"10.20.32.155:6789",
"10.20.32.154:6789"
],
"mtime": "2023-06-25 10:38:48",
"path": "/volumes/arch-1/arch/f246a31b-7103-41b9-8005-63d00efe88e4",
"pool_namespace": "",
"state": "complete",
"type": "subvolume",
"uid": 0
}
.Has anyone ever experienced this error? What way do you suggest to solve
it?
Hi,
I got many critical alerts in ceph dashboard. Meanwhile the cluster shows
health ok status.
See attached screenshot for detail. My questions are, are they real alerts?
How to get rid of them?
Thanks
Ben
Hi all
I have a Ceph cluster consisting of two zonegroups with metadata syncing
enabled. I need to change the owner of a bucket that is located in the
secondary zonegroup.
I followed the steps below:
Unlinked the bucket from the old user on the secondary zonegroup:
bash
Copy code
$ radosgw-admin bucket unlink --uid OLD_UID -b test-change-owner
Linked the bucket to the new user on the secondary zonegroup:
bash
Copy code
$ radosgw-admin bucket link --uid NEW_UID -b test-change-owner
Changed the owner of the bucket on the primary (master) zonegroup:
bash
Copy code
$ radosgw-admin bucket chown --uid NEW_UID -b test-change-owner
After executing the last command on the primary zonegroup, the bucket owner
was successfully changed. However, the ownership of the objects within the
bucket still remains with the old user.
When I executed the same radosgw-admin bucket chown command on the
secondary zonegroup, I received a warning about inconsistent metadata
between zones, but the bucket owner was changed successfully on the
secondary zonegroup.
My questions are:
What is the best way to change the owner of a bucket in a multi-zonegroup
cluster?
What are the potential impacts of running the chown command on the
secondary zonegroup? Is it possible to have inconsistent metadata between
zones in this case?
Hello,
we removed some nodes from our cluster. This worked without problems.
Now, lots of OSDs do not want to join the cluster anymore if we reboot
one of the still available nodes.
It always runs into timeouts:
--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300
MONs and MGRs are running fine.
Network is working, netcat to the MONs' ports are open.
Setting a higher debug level has no effect even if we add it to the
ceph.conf file.
The PGs are pretty unhappy, e. g.:
7.143 87771 0 0 0 0
314744902235 0 0 10081 10081
down 2023-06-20T09:16:03.546158+0000 961275'1395646
961300:9605547 [209,NONE,NONE] 209 [209,NONE,NONE]
209 961231'1395512 2023-06-19T23:46:40.101791+0000 961231'1395512
2023-06-19T23:46:40.101791+0000
PG query wants us to set an OSD lost however I do not want to do this.
OSDs are blocked by OSDs from the removed nodes:
ceph osd blocked-by
osd num_blocked
152 38
244 41
144 54
...
We added the removed hosts again and tried to start the OSDs on this
node and they also failed into the timeout mentioned above.
This is a containerized cluster running version 16.2.10.
Replication is 3, some pools use an erasure coded profile.
Best regards,
Malte
Hi
we have a brand new ceph instance deployed by ceph puppet module.
We are experiencing a funny issues. users caps change unexpected.
logs do not report any message about the user caps even with auth/debug_auth: 5/5
who/what can change the caps ?
thanks in advance
Ale
root@cephmon1:~# ceph version
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
root@cephmon1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
root@cephmon1:~#
root@cephmon1:~# ceph auth del client.cinder-backup
updated
root@cephmon1:~#
root@cephmon1:~#
root@cephmon1:~#
root@cephmon1:~# date
Fri Jun 23 07:52:40 AM CEST 2023
root@cephmon1:~#
root@cephmon1:~# ceph auth add client.cinder-backup
added key for client.cinder-backup
root@cephmon1:~# ceph auth caps client.cinder-backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=backup' mgr 'profile rbd pool=backup'
updated caps for client.cinder-backup
root@cephmon1:~#
root@cephmon1:~#
root@cephmon1:~#
root@cephmon1:~# date
Fri Jun 23 07:53:18 AM CEST 2023
root@cephmon1:~# ceph auth list
client.cinder-backup
key: AQBEM5VkhIfJHBAA6WP9P3HHCTSySdTqZv4Ypg==
caps: [mgr] profile rbd pool=backup
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=backup
root@cephmon1:~# ceph auth list
client.cinder-backup
key: AQAM0nJi8OYAFBAA+p+T2QWtKaq92Z/hFMgF4w==
caps: [mgr] profile rbd pool=backups
caps: [mon] profile rbd
caps: [osd] profile rbd pool=backups
root@cephmon1:~# date
Fri Jun 23 07:56:42 AM CEST 2023
Hi Eugene,
Thank you for your response, here is the update.
The upgrade to Quincy was done following the cephadm orch upgrade procedure
ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6
Upgrade completed with out errors. After the upgrade, upon creating the Grafana service from Ceph dashboard, it deployed Grafana 6.7.4. The version is hardcoded in the code, should it not be 8.3.5 as listed below in Quincy documentation? See below
[Grafana service started from Cephdashboard]
Quincy documentation states: https://docs.ceph.com/en/latest/releases/quincy/
……documentation snippet
Monitoring and alerting:
43 new alerts have been added (totalling 68) improving observability of events affecting: cluster health, monitors, storage devices, PGs and CephFS.
Alerts can now be sent externally as SNMP traps via the new SNMP gateway service (the MIB is provided).
Improved integrated full/nearfull event notifications.
Grafana Dashboards now use grafonnet format (though they’re still available in JSON format).
Stack update: images for monitoring containers have been updated. Grafana 8.3.5, Prometheus 2.33.4, Alertmanager 0.23.0 and Node Exporter 1.3.1. This reduced exposure to several Grafana vulnerabilities (CVE-2021-43798, CVE-2021-39226, CVE-2021-43798, CVE-2020-29510, CVE-2020-29511).
………………….
I notice that the versions of the remaining stack, that Ceph dashboard deploys, are also older than what is documented. Prometheus 2.7.2, Alertmanager 0.16.2 and Node Exporter 0.17.0.
AND 6.7.4 Grafana service reports a few warnings: highlighted below
root@fl31ca104ja0201:/home/general# systemctl status ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104ja0201.service
● ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104ja0201.service - Ceph grafana.fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
Loaded: loaded (/etc/systemd/system/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2023-06-13 03:37:58 UTC; 11h ago
Main PID: 391896 (bash)
Tasks: 53 (limit: 618607)
Memory: 17.9M
CGroup: /system.slice/system-ceph\x2dd0a3b6e0\x2dd2c3\x2d11ed\x2dbe05\x2da7a3a1d7a87e.slice/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e(a)grafana.fl31ca104j>
├─391896 /bin/bash /var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana.fl31ca104ja0201/unit.run
└─391969 /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --init --name ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-grafana-fl>
-- Logs begin at Sun 2023-06-11 20:41:51 UTC, end at Tue 2023-06-13 15:35:12 UTC. --
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="alter user_auth.auth_id to length 190"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth access token to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth refresh token to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth token type to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add OAuth expiry to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="Add index to user_id column in user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create server_lock table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add index server_lock.operation_uid"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create user auth token table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user_auth_token.auth_token"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index user_auth_token.prev_auth_token"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="create cache_data table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Executing migration" logger=migrator id="add unique index cache_data.cache_key"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Created default organization" logger=sqlstore
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing HTTPServer" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing BackendPluginManager" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing PluginManager" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Starting plugin search" logger=plugins
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing HooksService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing OSSLicensingService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing InternalMetricsService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing RemoteCache" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing RenderingService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing AlertEngine" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing QuotaService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing ServerLockService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing UserAuthTokenService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing DatasourceCacheService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing LoginService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing SearchService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing TracingService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing UsageStatsService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing CleanUpService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing NotificationService" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing provisioningServiceImpl" logger=server
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=warn msg="[Deprecated] the datasource provisioning config is outdated. please upgrade" logger=provisioning.datasources filename=/etc/grafana/provisioning/datasources/ceph-dashboard.yml
This warning comes due to the missing “ apiVersion: 1” first line entry in /etc/grafana/provisioning/datasources/ceph-dashboard.yml created by cephadm.
If the file is modified to include the apiversion line and restart Grafana service,
Is this a known ISSUE ?
Here is the content of the ceph-dashboard.yml produced by cephadm
deleteDatasources:
- name: 'Dashboard1'
orgId: 1
- name: 'Loki'
orgId: 2
datasources:
- name: 'Dashboard1'
type: 'prometheus'
access: 'proxy'
orgId: 1
url: 'http://fl31ca104ja0201.xxx.xxx.com:9095'
basicAuth: false
isDefault: true
editable: false
- name: 'Loki'
type: 'loki'
access: 'proxy'
orgId: 2
url: ''
basicAuth: false
isDefault: true
editable: false
--------------------------------------------------------------
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Dashboard1
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="inserting datasource from configuration " logger=provisioning.datasources name=Loki
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Backend rendering via phantomJS" logger=rendering renderer=phantomJS
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=warn msg="phantomJS is deprecated and will be removed in a future release. You should consider migrating from phantomJS to grafana-image-renderer plugin. Read more at https://grafana.com/docs/grafana/latest/administration/image_rendering/" logger=rendering renderer=phantomJS
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="Initializing Stream Manager"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+0000 lvl=info msg="HTTP Server Listen" logger=http.server address=[::]:3000 protocol=https subUrl= socket=
I also had to change a few other things to keep all the services running. The last issue that I have not been able to resolve yet is the Cephbash board gives this error even though grafana is running on the same server. However, the grafana dashboard cannot be accessed without tunnelling.
[cid:image002.png@01D9A10B.F8B9D220]
Hello guys,
We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD
for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows
clients.
We started noticing some unexpected performance issues with iSCSI. I mean,
an SSD pool is reaching 100MB of write speed for an image, when it can
reach up to 600MB+ of write speed for the same image when mounted and
consumed directly via RBD.
Is that performance degradation expected? We would expect some degradation,
but not as much as this one.
Also, we have a question regarding the use of Intel Turbo boost. Should we
disable it? Is it possible that the root cause of the slowness in the iSCSI
GW is caused by the use of Intel Turbo boost feature, which reduces the
clock of some cores?
Any feedback is much appreciated.