All;
I turned on device health metrics in one of our Nautilus clusters. Unfortunately, it doesn't seem to be collecting any information.
When I do "ceph device get-health-metrics <device>, I get the following;
{
"20200821-223626": {
"dev": "/dev/sdc",
"error": "smartctl failed",
"nvme_smart_health_information_add_log_error": "nvme returned an error: sudo: exit status: 1",
"nvme_smart_health_information_add_log_error_code": -22,
"nvme_vendor": "samsung_ssd_860_evo_4tb",
"smartctl_error_code": -22,
"smartctl_output": "smartctl returned an error (1): stderr:\nsudo: exit status: 1\nstdout:\n"
}
}
The cluster is Nautilus 14.2.16 (updated from 14.2.11 just after turning on health metrics). Smartctl is release 7.0 dated 2018-12-30 at 14:47:55 UTC.
Thoughts?
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos(a)PerformAir.com
www.PerformAir.com
Hi,
I have case where storage backend is about to change from OneFS to Ceph. Both are mainly for windows clients used as object storage.
Both ends have SAMBA configured as a GW with integration to AD. Both of the backends work and can be accessed and utilized but the problem is the data-transfer in between.
So far I have tested only robocopy but that would be a preferred tool since xfer is done via Windows machine mounting both source and target.
The copying it self works, no problem but copying the ACL for the files and folders doesn’t. If I provide “robocopy source target /MIR /SEC” I get access denied, only target top folder is created and it has wrong previleges.
The CEPHFS should be correctly mounted via SAMBA since I can create and copy files and folders when they are created by, in this case, the administrator. And I can change the permission after wards but this is not the way the copying is to be done. There is a lot of data and permissions are crucial to preserve.
I’ve tried many ways to save the permissions, among robocopy I’ve tried saving the permissions with "icacls save/restore” and with powershells command also. None of those seem to work.
Here’s a conclusion of what works:
creating new folder/files with administrator to source and robocopying them to target with /MIR and /SEC.
copying without /SEC.
Here’s also what doesn’t work or help:
Taking over permission for Administrator.
Remove non-existent users from source folder/files.
Removing in-heritage for the folder in source.
Changing owner to administrator.
The CEPH is Nautilus.
Any ideas or suggestiong for the topic?
Best regards and happy new year!
-Oskari
Hi,
All my OSD nodes in the SSD tier are getting heartbeat_map timed out
randomly and I don't find why!
7ff2ed3f2700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread
0x7ff2c8943700' had timed out after 15
It occurs many times in a day and causes my cluster to be down.
Is there any way to find why the OSDs get time out? I don't think it's
because of heartbeat and there is an issue with OSD that came to the
heartbeat to be timeout because ODSs don't suicide and OSDs get too slow
and cause downtime on RBD and S3 gateway because the queue is full!
Thanks.
Hi List,
In order te reproduce an issue we see on a production cluster (cephFS
client: ceph-fuse outperform kernel client by a factor of 5) we would
like to have a test cluster to have the same cephfs "flags" as
production. However, it's not completely clear how certain features
influence the cephfs flags. What I could find in the source code,
cephfs_features.h, is that it *seems* to correspond to the Ceph release.
For example CEPHFS_FEATURE_NAUTILUS gets a "12" as feature bit. An
upgraded (Luminous -> Mimic -> Nautilus) cephfs gives us the following
cephfs flags: "1c".
A (newly installed) Nautilus cluster gives "10" when new snapshots are
not allowed (ceph fs set cephfs allow_new_snaps false) and "12" when new
snapshots are allowed (ceph fs set cephfs allow_new_snaps true).
We would like to have the test cluster get the "1c" flags and see if we
can reproduce the issue. How can we achieve that?
Any info on how those cephfs flags are constructed is welcome.
Thanks,
Gr. Stefan
More banging on my prototype cluster, and ran into an odd problem.
Used to be, when I create an rbd device, then try to map it, it would initially fail, saying I have to disable some features.
Then I just run the suggested disable line -- usually
rbd feature disable poolname/rbdname object-map fast-diff deep-flatten
and then I can map it fine.
but now after the latest cluster recreation, when I try to map, I just get
# rbd map testpool/zfs02
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (110) Connection timed out
and no errors in dmesg output
if I try to disable those features anyway, I get
librbd::Operations: one or more requested features are already disabled(22) Invalid argument
nothing in /var/log/ceph/cephadm.log either
Any suggestions?
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown(a)medata.com| www.medata.com
Hello,
TL;DR How can I recreate the device_health_metrics pool?
I'm experimenting with Ceph Octopus v15.2.8 in a 3 node cluster under
Proxmox 6.3. After initializing CEPH the usual way, a
"device_health_metrics" pool is created as soon as I create the first
manager. That pool has just 1 PG but no OSD assigned, as OSD have not
been created yet. After creating a few OSD and waiting for a couple
days, that PG is still in "stale+undersized+peered" state.
So I thought I could just disable monitoring (ceph device monitoring
off), delete that pool and create it again with something like:
ceph osd pool create device_health_metrics 1 --autoscale-mode=off
The issue is that after recreating the pool and enabling monitoring
(ceph device monitoring on), I get no data stored in it regardind my
devices, even after running a manual scraping with ceph device
scrape-health-metrics.
Thank you in advance.
Victor
Hi,
I am trying to set up a new cluster with cephadm using a docker backend.
The initial boot strap did not finish cleanly and it errored out waiting for the mon-ip, I used the command:
cephadm bootstrap --mon-ip 192.168.0.1
With 192.168.0.1 being the ip address of this first host.
I tried the command again but it failed as the new ceph node was actually running so it could not bind to the ports.
After a bit of searching I was able to use "sudo cephadm shell —“ commands to change the username and password for the dashboard and login to it.
I then used cephadm to add a new host with "sudo cephadm shell — ceph orch host add host2”
Now in the inventory of the dashboard, and "ceph orch device ls” only devices on host2 are listed not host1.
In the Cluster/Hosts section of the dashboard host1 has its root volume drive listed in devices, and host2 has the root volume drive and drive for the OSD listed.
I successfully added an OSD with a drive on host2, trying the same command adjusted for host1 I get the following in the log:
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.369773808Z" level=info msg="shim containerd-shim started" address=/containerd-shim/80f876072532ebebdfef341a5c793654e27766f2d1708991a6f25599b24b6557.sock debug=false pid=28597
Dec 23 08:55:47 localhost bash[8745]: debug 2020-12-23T08:55:47.517+0000 ffff73d7a200 1 mon.host1(a)0(leader).osd e12 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 71303168 kv_alloc: 876609536
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.621748606Z" level=info msg="shim reaped" id=69a786e4a61605c1e6eca5a6e0e5ed0900635a214b0f1c96a4f26ea7911a12ff
Dec 23 08:55:47 localhost dockerd[2930]: time="2020-12-23T08:55:47.631479207Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.972437378Z" level=info msg="shim containerd-shim started" address=/containerd-shim/4a61d63e1f46722ffa7a950c31145d167c5c69087d003e5928a6aa3a4831f031.sock debug=false pid=28659
Dec 23 08:55:48 localhost bash[8745]: cluster 2020-12-23T08:55:46.892633+0000 mgr.host1.kkssvi (mgr.24098) 24278 : cluster [DBG] pgmap v24212: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:48 localhost bash[8756]: debug 2020-12-23T08:55:48.889+0000 ffff93573700 0 log_channel(cluster) log [DBG] : pgmap v24213: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.085+0000 ffff9056f700 0 log_channel(audit) log [DBG] : from='client.24206 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "host1:/dev/nvme0n1", "target": ["mon-mgr", ""]}]: dispatch
Dec 23 08:55:49 localhost bash[8745]: debug 2020-12-23T08:55:49.085+0000 ffff71575200 0 mon.host1@0(leader) e2 handle_command mon_command({"prefix": "osd tree", "states": ["destroyed"], "format": "json"} v 0) v1
Dec 23 08:55:49 localhost bash[8745]: debug 2020-12-23T08:55:49.085+0000 ffff71575200 0 log_channel(audit) log [DBG] : from='mgr.24098 192.168.0.1:0/2486989775' entity='mgr.host1.kkssvi' cmd=[{"prefix": "osd tree", "states": ["destroyed"], "format": "json"}]: dispatch
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.089+0000 ffff8ed6d700 0 log_channel(cephadm) log [INF] : Found osd claims -> {}
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.089+0000 ffff8ed6d700 0 log_channel(cephadm) log [INF] : Found osd claims for drivegroup None -> {}
Dec 23 08:55:49 localhost containerd[1470]: time="2020-12-23T08:55:49.331868093Z" level=info msg="shim reaped" id=780a38dd49fce4a823c4c3d834abdd1cc17bbe0c0aa4f2dd7caeddf8dce1708e
Dec 23 08:55:49 localhost dockerd[2930]: time="2020-12-23T08:55:49.341765820Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 23 08:55:49 localhost systemd[24638]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895-merged.mount: Succeeded.
Dec 23 08:55:49 localhost systemd[1]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895-merged.mount: Succeeded.
Dec 23 08:55:49 localhost bash[8745]: audit 2020-12-23T08:55:49.091014+0000 mon.host1 (mon.0) 1093 : audit [DBG] from='mgr.24098 192.168.0.1:0/2486989775' entity='mgr.host1.kkssvi' cmd=[{"prefix": "osd tree", "states": ["destroyed"], "format": "json"}]: dispatch
Dec 23 08:55:50 localhost bash[8745]: cluster 2020-12-23T08:55:48.893433+0000 mgr.host1.kkssvi (mgr.24098) 24279 : cluster [DBG] pgmap v24213: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:50 localhost bash[8745]: audit 2020-12-23T08:55:49.087597+0000 mgr.host1.kkssvi (mgr.24098) 24280 : audit [DBG] from='client.24206 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "host1:/dev/nvme0n1", "target": ["mon-mgr", ""]}]: dispatch
Dec 23 08:55:50 localhost bash[8745]: cephadm 2020-12-23T08:55:49.093552+0000 mgr.host1.kkssvi (mgr.24098) 24281 : cephadm [INF] Found osd claims -> {}
Dec 23 08:55:50 localhost bash[8745]: cephadm 2020-12-23T08:55:49.093933+0000 mgr.host1.kkssvi (mgr.24098) 24282 : cephadm [INF] Found osd claims for drivegroup None -> {}
The other problem is logging is set to debug for both hosts, I tried "sudo cephadm shell -- ceph daemon mon.host1 config set mon_cluster_log_file_level info” which reports success but logging remains at debug level.
If I try the same command with mon.host2 I get
INFO:cephadm:Inferring fsid ae111111-1111-1111-1111-f1111a11111a
INFO:cephadm:Inferring config /var/lib/ceph/ae147088-4486-11eb-9044-f1337a55707a/mon.host1/config
INFO:cephadm:Using recent ceph image ceph/ceph:v15
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Which looks like it is trying to use the config for host1 on host2?
Thanks,
Duncan
Dear ceph folks,
rbd_cache can be set up as a read /write cache for librbd, widely used with openstack cinder. Does krbd has a silmilar cache controll mechanism or not? I am using krbd for iSCSI and NFS backend storage, and wonder whether a cache setting exists for krbd.
thanks in advance,
Samuel
huxiaoyu(a)horebdata.cn