Hello,
We are running Mimic 13.2.8 with our cluster, and since upgrading to 13.2.8
the Prometheus plugin seems to hang a lot. It used to respond under 10s but
now it often hangs. Restarting the mgr processes helps temporarily but
within minutes it gets stuck again.
The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target" and
needs to
be kill -9'ed.
Is there anything I can do to address this issue, or at least get better
visibility into the issue?
We only have a few plugins enabled:
$ ceph mgr module ls
{
"enabled_modules": [
"balancer",
"prometheus",
"zabbix"
],
3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
a busy one with lots of rebalancing. (I don't know if a busy cluster would
seriously affect the mgr's performance, but just throwing it out there)
services:
mon: 5 daemons, quorum
woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
mds: cephfs-1/1/1 up {0=woodenbox6=up:active}, 1 up:standby-replay
osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
rgw: 4 daemons active
Thanks in advance for your help,
-Paul Choi
Hi guys,
This is documented as an experimental feature, but it doesn’t explain how to ensure that affinity for a given MDS sticks to the second filesystem you create. Has anyone had success implementing a second CephFS? In my case it will be based on a completely different pool from my first one.
Thanks.
J
It works well for me, been running a couple clusters for 1-2 years where all OSD hosts (~200) have no system disks and instead netboot from PXE.
No NFS server involved, each host loads the same system image (Debian Live squashfs) into memory on boot and runs independently from there on out. Takes some trickery to configure and bring the OSDs up on boot (using puppet in my case), though that might get easier with the containerized approach in Ceph 15+.
Best,
Eric
> On 21 Mar 2020, at 14:18, huxiaoyu(a)horebdata.cn wrote:
>
> Hi, Marc,
>
> Indeed PXE boot makes a lot sense in large cluster, cuting down OS deployment and management burden, but only iff no single of failure is guaranteed...
>
> best regards,
>
> samuel
>
>
>
> huxiaoyu(a)horebdata.cn
>
> From: Marc Roos
> Date: 2020-03-21 14:13
> To: ceph-users; huxiaoyu; martin.verges
> Subject: RE: [ceph-users] Questions on Ceph cluster without OS disks
>
> I would say it is not a 'proven technology' otherwise you would see a
> wide spread implementation and adaptation of this method. However if you
> really need the physical disk space, it is a solution. Although I also
> would have questions on creating an extra redundant environment to
> service remote booting, just to spare a os disk position. Maybe this
> makes more sence in really big environments.
>
>
>
>
>
> -----Original Message-----
> From: huxiaoyu(a)horebdata.cn [mailto:huxiaoyu@horebdata.cn]
> Sent: 21 March 2020 13:54
> To: Martin Verges; ceph-users
> Subject: [ceph-users] Questions on Ceph cluster without OS disks
>
> Hello, Martin,
>
> I notice that Croit advocate the use of ceph cluster without OS disks,
> but with PXE boot.
>
> Do you use a NFS server to serve the root file system for each node?
> such as hosting configuration files, user and password, log files, etc.
> My question is, will the NFS server be a single point of failure? If the
> NFS server goes down, the network experience any outage, ceph nodes may
> not be able to write to the local file systems, possibly leading to
> service outage.
>
> How do you deal with the above potential issues in production? I am a
> bit worried...
>
> best regards,
>
> samuel
>
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
I'm using Intel Optane disks to provide WAL/DB capacity for my Ceph cluster (which is part of Proxmox - for VM hosting).
I've read that WAL/DB partitions only use either 3GB, or 30GB, or 300GB - due to the way that RocksDB works.
Is this true?
My current partition for WAL/DB is 145 GB - does this mean that 115Gb of that will be permanently wasted?
Is this behaviour documented somewhere, or is there some background, so I can understand a bit more about how it works?
Thanks,
Victor
How to get rid of this logging??
Mar 31 13:40:03 c01 ceph-mgr: 2020-03-31 13:40:03.521 7f554edc8700 0
log_channel(cluster) log [DBG] : pgmap v672067: 384 pgs: 384
active+clean;
Hello,
I have configured a multisite ceph.
The master zone has not changed but on the destination zone I had some
problems.
On the destination zone I cleaned and reinstalled the radosgw, but trying
to assing the same zone name it had before reinstallation does not work
(radosgw does not start).
I changed its zone name and radosgw starts, but the master is using the
old name when I try to execute:
radosgw-admin period update commit
Please, how can I solve this problem ?
Thanks
Ignazi
Hi all,
i‘m running a 3 node ceph cluster setup with collocated mons and mds for actually 3 filesystems at home since mimic. I’m planning to downgrade to one FS and use RBD in the future, but this is another story. I’m using the cluster as cold storage on spindles with EC-pools for archive purposes. The cluster usually does not run 24/7. I actually managed to upgrade to octopus without problems yesterday. So first of all: great job with the release.
Now I have a little problem and a general question to address.
I have tried to share the CephFS via samba and the ceph-vfs module but I could not manage to get write access (read access is not a problem) to the share (even with the admin key). When I share the mounted path (kernel module or fuser mount) instead as usual there are no problems at all. Is ceph-vfs generally read only and I missed this point? Furthermore I suppose, that there is no possibility to choose between the different mds namespaces, right?
Now the general question. Since the cluster does not run 24/7 as stated and is turned on perhaps once a week for a couple of hours on demand, what are reasonable settings for the scrubbing intervals? As I said, the storage is cold and there is mostly read i/o. The archiving process adds approximately 0.5 % of new data of the cluster’s total storage capacity.
Stay healthy and regards,
Marco Savoca
Hi all,
I am installing ceph Nautilus and getting constantly errors while adding iscsi gateways
It was working using http schema but after moving to https with wildcard certs gives API errors
Below some of my configurations
Thanks for your help
Command:
ceph --cluster ceph dashboard iscsi-gateway-add https://myadmin:admin.01@1.2.3.4:5050
Error:
Error EINVAL: iscsi REST API cannot be reached. Please check your configuration and that the API endpoint is accessible
Tried also disabling ssl verify
# ceph dashboard set-rgw-api-ssl-verify False
Option RGW_API_SSL_VERIFY updated
"/etc/ceph/iscsi-gateway.cfg" 23L, 977C
# Ansible managed
[config]
api_password = admin.01
api_port = 5050
# API settings.
# The API supports a number of options that allow you to tailor it to your
# local environment. If you want to run the API under https, you will need to
# create cert/key files that are compatible for each iSCSI gateway node, that is
# not locked to a specific node. SSL cert and key files *must* be called
# 'iscsi-gateway.crt' and 'iscsi-gateway.key' and placed in the '/etc/ceph/' directory
# on *each* gateway node. With the SSL files in place, you can use 'api_secure = true'
# to switch to https mode.
# To support the API, the bear minimum settings are:
api_secure = True
# Optional settings related to the CLI/API service
api_user = myadmin
cluster_name = ceph
loop_delay = 1
trusted_ip_list = 1.2.3.3,1.2.3.4
Log file
======
ceph-rgw-cnode04.rgw0.log
2020-03-30 10:24:20.392 7f6a2dc1b700 1 ====== req done req=0x561d9ce465f0 op status=0 http_status=200 latency=0.0119993s ======
2020-03-30 10:24:20.394 7f6a2cc19700 1 ====== starting new request req=0x561d9ce465f0 =====
2020-03-30 10:24:20.396 7f6a2cc19700 1 ====== req done req=0x561d9ce465f0 op status=0 http_status=404 latency=0.00199988s ======
2020-03-30 10:24:20.397 7f6a2bc17700 1 ====== starting new request req=0x561d9ce465f0 =====
2020-03-30 10:24:20.410 7f6a2bc17700 1 ====== req done req=0x561d9ce465f0 op status=0 http_status=200 latency=0.0129992s ======
2020-03-30 10:24:20.499 7f6a27c0f700 1 ====== starting new request req=0x561d9cec25f0 =====
2020-03-30 10:24:20.502 7f6a27c0f700 1 ====== req done req=0x561d9cec25f0 op status=0 http_status=200 latency=0.00299982s ======
2020-03-30 10:24:20.504 7f6a2740e700 1 ====== starting new request req=0x561d9cec25f0 =====
2020-03-30 10:24:20.506 7f6a2740e700 1 ====== req done req=0x561d9cec25f0 op status=0 http_status=200 latency=0.00199988s ======
2020-03-30 10:24:30.516 7f6a22404700 1 ====== starting new request req=0x561d9cf825f0 =====
2020-03-30 10:24:30.518 7f6a22404700 1 ====== req done req=0x561d9cf825f0 op status=0 http_status=200 latency=0.00199988s ======
2020-03-30 10:24:30.620 7f6a1ebfd700 1 ====== starting new request req=0x561d9cf925f0 =====
2020-03-30 10:24:30.622 7f6a1ebfd700 1 ====== req done req=0x561d9cf925f0 op status=0 http_status=200 latency=0.00199988s ======
2020-03-30 10:24:30.708 7f6a19bf3700 1 ====== starting new request req=0x561d9cfd45f0 =====
2020-03-30 10:24:30.708 7f6a193f2700 1 ====== starting new request req=0x561d9cfaa5f0 =====
2020-03-30 10:24:30.710 7f6a19bf3700 1 ====== req done req=0x561d9cfd45f0 op status=0 http_status=200 latency=0.00199988s ======
2020-03-30 10:24:30.711 7f6a193f2700 1 ====== req done req=0x561d9cfaa5f0 op status=0 http_status=200 latency=0.00299982s ======
/ceph-rgw-cnode05.rgw0.log
2020-03-30 10:07:41.309 7fb79d31c700 1 ====== req done http_status=400 ======
2020-03-30 10:07:41.505 7fb798312700 1 ====== starting new request req=0x5565d88b45f0 =====
2020-03-30 10:07:41.508 7fb798312700 1 ====== req done req=0x5565d88b45f0 op status=0 http_status=200 latency=0.00299982s ======
2020-03-30 10:07:41.531 7fb79430a700 1 failed to read header: bad method
2020-03-30 10:07:41.531 7fb79430a700 1 ====== req done http_status=400 ======
2020-03-30 10:07:41.552 7fb791304700 1 failed to read header: bad method
2020-03-30 10:07:41.552 7fb791304700 1 ====== req done http_status=400 ======
(END)
Hello List,
is this a bug?
root@ceph02:~# ceph cephadm generate-key
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/module.py", line 1413, in _generate_key
with open(path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4ejhr7wh/key'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 110, in
handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 308, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 72, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 63, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1418, in _generate_key
os.unlink(path)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp4ejhr7wh/key'
root@ceph02:~# dpkg -l |grep ceph
ii ceph-base 15.2.0-1~bpo10+1
amd64 common ceph daemon libraries and management tools
ii ceph-common 15.2.0-1~bpo10+1
amd64 common utilities to mount and interact with a ceph
storage cluster
ii ceph-deploy 2.0.1
all Ceph-deploy is an easy to use configuration tool
ii ceph-mds 15.2.0-1~bpo10+1
amd64 metadata server for the ceph distributed file system
ii ceph-mgr 15.2.0-1~bpo10+1
amd64 manager for the ceph distributed storage system
ii ceph-mgr-cephadm 15.2.0-1~bpo10+1
all cephadm orchestrator module for ceph-mgr
ii ceph-mgr-dashboard 15.2.0-1~bpo10+1
all dashboard module for ceph-mgr
ii ceph-mgr-diskprediction-cloud 15.2.0-1~bpo10+1
all diskprediction-cloud module for ceph-mgr
ii ceph-mgr-diskprediction-local 15.2.0-1~bpo10+1
all diskprediction-local module for ceph-mgr
ii ceph-mgr-k8sevents 15.2.0-1~bpo10+1
all kubernetes events module for ceph-mgr
ii ceph-mgr-modules-core 15.2.0-1~bpo10+1
all ceph manager modules which are always enabled
ii ceph-mgr-rook 15.2.0-1~bpo10+1
all rook module for ceph-mgr
ii ceph-mon 15.2.0-1~bpo10+1
amd64 monitor server for the ceph storage system
ii ceph-osd 15.2.0-1~bpo10+1
amd64 OSD server for the ceph storage system
ii cephadm 15.2.0-1~bpo10+1
amd64 cephadm utility to bootstrap ceph daemons with systemd
and containers
ii libcephfs1 10.2.11-2
amd64 Ceph distributed file system client library
ii libcephfs2 15.2.0-1~bpo10+1
amd64 Ceph distributed file system client library
ii python-ceph-argparse 14.2.8-1
all Python 2 utility libraries for Ceph CLI
ii python3-ceph-argparse 15.2.0-1~bpo10+1
all Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 15.2.0-1~bpo10+1
all Python 3 utility libraries for Ceph
ii python3-cephfs 15.2.0-1~bpo10+1
amd64 Python 3 libraries for the Ceph libcephfs library
root@ceph02:~# cat /etc/debian_version
10.3
Thanks,
Michael