Hi,
How do I set the correct URL to Grafana in a
new cephadm bootstrapped cluster?
When I try to access the performance parts of the
Ceph dashboard my browser tells me that it cannot
resolve the short hostname that is presented in the
URL to Grafana.
cephadm seems to use only the hostname and not the
FQDN which is needed when using the Dashboard from
the browser (at least in this setup).
This obviously did not work:
root@ceph01:~# ceph dashboard get-grafana-api-url
https://ceph01:3000
root@ceph01:~# ceph dashboard set-grafana-api-url https://ceph01.ceph.heinlein-akademie.de:3000
Option GRAFANA_API_URL updated
root@ceph01:~# ceph config get mgr mgr/dashboard/GRAFANA_API_URL
https://ceph01:3000
root@ceph01:~# ceph config set mgr mgr/dashboard/GRAFANA_API_URL https://ceph01.ceph.heinlein-akademie.de:3000
root@ceph01:~# ceph config get mgr mgr/dashboard/GRAFANA_API_URL
https://ceph01:3000
root@ceph01:~# ceph dashboard get-grafana-api-url
https://ceph01:3000
Regards
--
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin
http://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Hi
My fellows wanna use ceph rgw to store ES backup and Nexus blobs.
But the services cannot connect to the rgw with s3 protocol when I
provided them with the frontend nginx address(virtual ip). Only when
they use the backend rgw's address(real ip) the ES and Nexus works
well with rgw.
Has anyone meet the same issue?
Thanks
Hi all,
We had a stuck mds ops this morning on a 14.2.11 cephfs cluster. I
tried to ls the path from another client and that blocked.
The ops were like this:
# egrep 'desc|flag|age' ops.txt
"description": "client_request(client.1212755100:37475
lookup #0x1003e229d38/analytics-logs 2020-09-03 04:30:45.607640
caller_uid=2004, caller_gid=2004{})",
"age": 4975.7069706660004,
"flag_point": "failed to authpin, subtree is being exported",
"description": "client_request(client.1212755100:37477
lookup #0x1003e229d38/bundled-plugins 2020-09-03 04:31:01.064591
caller_uid=2004, caller_gid=2004{})",
"age": 4960.2499044630003,
"flag_point": "failed to authpin, subtree is being exported",
...
The full list of stuck ops is at https://termbin.com/8itv
We had only just yesterday enabled 2 active mds's on this cluster.
I don't have much info about this client, other than that it has been
running for several weeks successfully until we enabled the 2nd active
mds.
To clear out the slow ops we evicted client.1212755100, and then we
reduced back to 1 active mds so it doesn't happen again.
Any ideas what might have triggered this?
Cheers, Dan
Hi,
The cluster I'm writing about has a long history (months) of instability
mainly related to large RocksDB database and high memory consumption.
The use-case is RGW with an EC8+3 pool for data.
In the last months this cluster has been suffering from OSDs using much
more memory then osd_memory_target and mainly allocated in buffer_anon.
After removing a lot of data from the cluster and re-installing all OSDs
there is one thing remaining: High memory usage when *NOT* writing data
to the cluster.
There is a script running which keeps writing data to RADOS in a slow
pace. Once this stops we are observing the memory usage of the OSDs grow
steadily and also see the RocksDB databases of the BlueStore OSDs grow.
Once we start to write again memory usage (buffer_annon) reduces.
I think this is related to the pglogs, but even trimming all the pglogs
does not solve this issue.
Has anybody seen this before or has any clues where to start looking?
Ceph version 14.2.8
Wido
Hi,
I've had a complete monitor failure, which I have recovered from with the steps here: https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/…
The data and metadata pools are there and are completely intact, but ceph is reporting that there are no filesystems, where (before the failure) there was one.
Is there any way of putting the filesystem back together again without having to resort to having to rebuild a new metadata pool with cephfs-data-scan?
I'm on ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Thanks,
Harlan
Hi,
Could someone help me what is the issue with our dployment steps please?
Initial RGW Cluster_1
===================================================================================ADD_RGW_TO_CLUSTERCreate Default Realm
- sudo radosgw-admin realm create --rgw-realm=default --defaultCreate Default Zone Group
- sudo radosgw-admin zonegroup create --rgw-zonegroup=default --master --defaultCreate Default Zone
- sudo radosgw-admin zone create --rgw-zonegroup=default --rgw-zone=default --master --defaultUpdate Period
- sudo radosgw-admin period update --rgw-realm=default --commitCreate RGW on 3 OSD_Node
- sudo ceph orch apply rgw default default --placement=“3 bk-otpsmon-1001 bk-otpsmon-1002 bk-otpsmon-1003”
===================================================================================Initial RGW Cluster_2
- sudo ceph orch apply rgw default default --placement=“3 hy-otpsmon-2001 hy-otpsmon-2002 bk-otpsmon-2003"
HOW_TO_CONFIGURE_CEPH_OCTOPUS_MULTISITE
===================================================================================================
configure multisite
radosgw-admin realm create \
--rgw-realm=agoda \
--default
radosgw-admin zonegroup create \
--rgw-zonegroup=data \
--endpoints=http://bk-otpsmon-1001:80,http://bk-otpsmon-1002:80,http://bk-otpsmon-1003:80 \
--rgw-realm=agoda \
--master \
--default
***** BUG Run Command-line => Can't send shot command-line *****
radosgw-admin zone create --rgw-zonegroup=data --rgw-zone=bk --master --default --endpoints=http://bk-otpsmon-1001:80,http://bk-otpsmon-1002:80,http://bk-otpsmon-1003:80
***** BUG Can't Delete Pool *****
***** ceph config set mon mon_allow_pool_delete true
***** ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
***** (optional) systemctl |grep -i ceph
***** (optional) systemctl restart ceph-85b0c358-e2a2-11ea-8864-000c29fa922a(a)mon.bk-otpsmon-1001.service
# radosgw-admin zonegroup remove --rgw-zonegroup=default --rgw-zone=default
# radosgw-admin period update --commit
# radosgw-admin zone delete --rgw-zone=default
# radosgw-admin period update --commit
# radosgw-admin zonegroup delete --rgw-zonegroup=default
# radosgw-admin period update --commit
# ceph osd pool rm default.rgw.meta default.rgw.meta --yes-i-really-really-mean-it
# ceph osd pool rm default.rgw.control default.rgw.control --yes-i-really-really-mean-it
# ceph osd pool rm default.rgw.log default.rgw.log --yes-i-really-really-mean-it
radosgw-admin user create --uid="ceph-sync" --display-name="ceph-sync" --system
"user": "ceph-sync",
"access_key": "VWV6566957QGVDV6ITJM",
"secret_key": "qAEGXslUHBeWv7O6VMCmdo0z2AgMyBZlcKqg38H7"
radosgw-admin zone modify \
--rgw-zone=bk \
--access-key=VWV6566957QGVDV6ITJM \
--secret=qAEGXslUHBeWv7O6VMCmdo0z2AgMyBZlcKqg38H7
radosgw-admin period update --rgw-realm=agoda --commit
Configure Rados Gateway Client at RGW Nodes
[client.rgw.bk-otpsmon-1001]
host = bk-otpsmon-1001
rgw frontends = "civetweb port=80"
rgw_zone=bk
[client.rgw.bk-otpsmon-1002]
host = bk-otpsmon-1002
rgw frontends = "civetweb port=80"
rgw_zone=bk
[client.rgw.bk-otpsmon-1003]
host = bk-otpsmon-1003
rgw frontends = "civetweb port=80"
rgw_zone=bk
Restart All RGW Container Node
rgw_1: systemctl restart ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1001.atdfmv.service
rgw_2: systemctl restart ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1002.wkieqj.service
rgw_3: systemctl restart ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1003.jpyzdq.service
Enable All RGW Container Node
rgw_1: systemctl enable ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1001.atdfmv.service
rgw_2: systemctl enable ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1002.wkieqj.service
rgw_3: systemctl enable ceph-89c72a6c-eb95-11ea-b88b-000c29147836(a)rgw.default.default.bk-otpsmon-1003.jpyzdq.service
=====================================
Secondary Zones
radosgw-admin realm pull --url={url-to-master-zone-gateway} --access-key={access-key} --secret={secret}
radosgw-admin realm pull --url=http://bk-otpsmon-1001:80 --access-key=CL7NF0DLYL7D2YYVR9HA --secret=C49mBXDNgHl9fNwibdaQamvffB9QM2RNj5snUq03
radosgw-admin realm pull --url=http://bk-otpsmon-1001:80 --access-key=CL7NF0DLYL7D2YYVR9HA --secret=C49mBXDNgHl9fNwibdaQamvffB9QM2RNj5snUq03 --rgw-realm=agoda
And here is the errors:
Error From Cluster_2 After Realm Pull[root@hy-otpsmon-2001 ~]# radosgw-admin realm pull --url=http://bk-otpsmon-1001:80<http://bk-otpsmon-1001/> --access-key=CL7NF0DLYL7D2YYVR9HA --secret=C49mBXDNgHl9fNwibdaQamvffB9QM2RNj5snUq03 --rgw-realm=agoda
request failed: (13) Permission denied
If the realm has been changed on the master zone, the master zone's gateway may need to be restarted to recognize this user.
[root@hy-otpsmon-2001 ~]#
Error From Cluster_1 RGW Container LogSep 02 11:20:52 bk-otpsmon-1001 bash[1246]: debug 2020-09-02T04:20:52.963+0000 7f3b0747f700 1 ====== starting new request req=0x7f3b455118a0 =====
Sep 02 11:20:52 bk-otpsmon-1001 bash[1246]: debug 2020-09-02T04:20:52.967+0000 7f3b0747f700 1 op->ERRORHANDLER: err_no=-2028 new_err_no=-2028
Sep 02 11:20:52 bk-otpsmon-1001 bash[1246]: debug 2020-09-02T04:20:52.967+0000 7f3b0747f700 1 ====== req done req=0x7f3b455118a0 op status=0 http_status=403 latency=0.002999559s ======
All help appreciated.
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
I'd like to gain a better understanding about what operations emit which
of these performance counters, in particular when is 'op_rw' incremented
instead of 'op_r' + 'op_w'?
I've done a little bit of investigation (v12.2.13) , running various
workoads and operations against an RBD volume (in a cluster with no
other client activity):
- Most RBD 'operations' (create, rm, features disable/enable, map,
unmap) emit 'op_rw' and often 'op_w' too
- Program reads and writes against a mounted RBD volume *only* emit
'op_r' and 'op_w' (never 'op_rw'), regardless of whether they are 'read
+ modify' of existing file data (or whether the writes are buffered,
direct or sync)
Is that correct? Or have I missed a program driven workload that will
produce 'op_rw'? [1]
In our production clusters I'm seeing similar numbers of 'op_w' and
'op_rw' (for a given OSD), which would suggest a lot of RBD operations
if it is only them that cause 'op_rw' counters to be emitted.
Cheers
Mark
[1] Tested using fio and pgbench (database benchmark). I mounted the
volume using the kernel driver (I'll do some more experimentation using
librbd)
Hi,
I have set a 3 host cluster with 30 OSDs total. Cluster has health OK and no warning whatsoever. I set an RBD pool and 14 images which werer all rbd-mirrored to a second cluster (which was disconnected since problems began) and also an iSCSI interface. Then I connected a Windows 2019 Server through iSCSI, mounted all 14 drives and created a spanned volume with all the drives. Everything was working fine, but I had to disconnect the server, so I disconnected the iSCSI interface and when I tried to reconnect my volume was unusable and drives seemed stuck. I ended rebooting each cluster node and then later, since I still couldn't use my images, removed and recreated all images.
in this second run all was good and I had a robocopy syncing files for almost a week to my ceph cluster and had copied more than 5TB of data already when my Windows Server got stuck. Still not sure why it got stuck, but some services like FTP were responding but others, including login, were not. So I reset Windows server and when it was back up my spanned volume was bad again. I've been trying to recover it for the last 2 days but without success.
Right now all images are disconnected, I have no locks (found some at some point and removed, but not sure who was locking) and no watchers in any of the images, but the 3 images that had data in it are corrupt or locked somehow. Nothing I try works on them and the operation gets stuck. I can edit the images' config, but not these 3. I can create snapshots, but not these 3. I managed to mount images using iSCSI in a Linux box, but these 3 get Linux commands (fdisk, parted) hanging. Ceph dashboard shows some stats like read and write rate for all images, but these 3.
It seems something inside the image is broken or stuck but as I said no locks on them.
I tried a lot of options and somehow my cluster now has some RGW pools that I have no idea where they came from.
Any idea what I should do?
--
Salsa