Hi,
why would ceph osd df show in SIZE field smaller number than there is:
85 hdd 0.89999 1.00000 100 GiB 96 GiB 95 GiB 289 KiB 952
MiB 4.3 GiB 95.68 3.37 10 up
instead of 100GiB there should be 5.5TiB.
Kind regards,
Rok
Dan, Igor
Seems this wasn't backported? Get stored == used on luminous->Nautilus 14.2.21.
What solution is? Find OSD's with zero bytes reported, drain->deploy it?
Thanks,
k
Sent from my iPhone
Hi Eugen thank you very much for your reply. I'm Manuel, a colleague of Sebastián.
I complete what you ask us.
We have checked more ceph commands, not only ceph crash and ceph org and many other commands are equally hung:
[spsrc-mon-1 ~]# cephadm shell -- ceph pg stat
hangs forever
[spsrc-mon-1 ~]# cephadm shell -- ceph status
Works
[spsrc-mon-1 ~]# cephadm shell -- ceph progress
hangs forever
[spsrc-mon-1 ~]# cephadm shell -- ceph balancer status
hangs forever
[spsrc-mon-1 ~]# cephadm shell -- ceph crash ls
hangs forever
[spsrc-mon-1 ~]# cephadm shell -- ceph crash stat
hangs forever
[spsrc-mon-1 ~]# cephadm shell -- ceph telemetry status
hangs forever
We have checked the call made from the container by checking DEBUG logs and I see that it is correct, in some commands work but others hang:
2021-05-20 09:56:02,903 DEBUG Running command (timeout=None): /bin/docker run --rm --ipc=host --net=host --privileged --group-add=disk -e CONTAINER_IMAGE=172.16.3.146:4000/ceph/ceph:v15.2.9 -e NODE_NAME=spsrc-mon-1 -v /var/run/ceph/3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c:/var/run/ceph:z -v /var/log/ceph/3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c:/var/log/ceph:z -v /var/lib/ceph/3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /var/lib/ceph/3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c/mon.spsrc-mon-1/config:/etc/ceph/ceph.conf:z -v /etc/ceph/ceph.client.admin.keyring:/etc/ceph/ceph.keyring:z --entrypoint ceph 172.16.3.146:4000/ceph/ceph:v15.2.9 pg stat
We have 3 monitor nodes and these are the containers that are running (on all monitor nodes):
acf8870fc788 172.16.3.146:4000/ceph/ceph:v15.2.9 "/usr/bin/ceph-mds -…" 7 days ago Up 7 days ceph-3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c-mds.manila.spsrc-mon-1.gpulzs
cfac86f29db4 172.16.3.146:4000/ceph/ceph:v15.2.9 "/usr/bin/ceph-mon -…" 7 days ago Up 7 days ceph-3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c-mon.spsrc-mon-1
4e6e600fa915 172.16.3.146:4000/ceph/ceph:v15.2.9 "/usr/bin/ceph-crash…" 7 days ago Up 7 days ceph-3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c-crash.spsrc-mon-1
dae36c48568e 172.16.3.146:4000/ceph/ceph:v15.2.9 "/usr/bin/ceph-mgr -…" 7 days ago Up 7 days ceph-3cdbf59a-a74b-11ea-93cc-f0d4e2e6643c-mgr.spsrc-mon-1.eziiam
All with running status in all the 3 monitor nodes. As you see in this monitor, we have MDS, MON, CRASH and MGR.
Any ideas what we can check?.
Best regards,
Manu
274456702(a)qq.com
From: ceph-users-request
Date: 2021-05-21 13:55
To: 274456702(a)qq.com
Subject: Welcome to the "ceph-users" mailing list
Welcome to the "ceph-users" mailing list!
To post to this list, send your email to:
ceph-users(a)ceph.io
You can unsubscribe or make adjustments to your options via email by
sending a message to:
ceph-users-request(a)ceph.io
with the word 'help' in the subject or body (don't include the
quotes), and you will get back a message with instructions. You will
need your password to change your options, but for security purposes,
this password is not included here. If you have forgotten your
password you will need to reset it via the web UI.
Hello,
I am contacting you on behalf of the Computer Science Club of the
University of Waterloo (https://csclub.uwaterloo.ca) to add our mirror
(https://mirror.csclub.uwaterloo.ca) as an official mirror of the Ceph
project. Our mirror is located at the
University of Waterloo in Waterloo, Ontario, Canada.
The official contact for the mirror is
systems-committee(a)csclub.uwaterloo.ca.
Following the instructions located at
https://ceph.io/get/, I have confirmed that our mirror meets the requirements:
* 1Gbit connection or more
Our mirror is connected to the University's network at 10Gbps, but is
restricted to the internet at 1Gbps on each of the University's 3
10Gbps internet links.
* Native IPv4 and IPv6
Our mirror is accessible over native IPv4 and IPv6.
mirror.csclub.uwaterloo.ca
--------------------------
129.97.134.71
2620:101:f000:4901:c5c::f:1055
* HTTP access
* rsync access
Our mirror offers HTTP, HTTPS, FTP and RSYNC access.
HTTP: http://mirror.csclub.uwaterloo.ca/ceph/
HTTPS: https://mirror.csclub.uwaterloo.ca/ceph/
FTP: ftp://mirror.csclub.uwaterloo.ca/ceph/
RSYNC: rsync://mirror.csclub.uwaterloo.ca/ceph/
* 2TB of storage or more
We have plenty of space to mirror ceph.
* Monitoring of the mirror/source
We have automated monitoring tracking mirror availability.
* Logs
Logs from HTTP/HTTPS and RSYNC requests are being stored.
We are currently configured to sync from the source repository every 3
hours. The initial sync of the repository has been completed.
Please let me know if there is any additional information required
and/or any questions.
Thanks,
Zachary Seguin
Systems Committee
Computer Science Club | University of Waterloo
syscom(a)csclub.uwaterloo.ca
Hello,
i want to test something with cephfs subvolume an how to mount it and set quota.
after some "ceph fs" commands I got an E-Mail from Prometheus that the cluster is in
"Health Warn".
The Error was that every MDS crash with a Segfault.
Following Some Information of my cluster.
The cluster is running via podman.
ceph version 16.2.3 (381b476cb3900f9a92eb95d03b4850b953cfd79a) pacific (stable)
The commands I habe used to setup my subvolume.
ceph fs subvolumegroup create cephfs test-mount
# Size 10G
ceph fs subvolume create cephfs test-mount-volume test-mount --size=1073741824
ceph fs subvolume info cephfs test-mount-volume test-mount
# Failed
ceph fs subvolume authorize cephfs test-mount-volume test-mount-client test-mount /
# Success
ceph fs subvolume authorize cephfs test-mount-volume test-mount-client test-mount
Now i got from Prometheus a Health Warn Alert.
The reason is MDS stuck in "replay" loop. No MDS come up anymore
All MDS crashes with the following Error Message.
# MDS Error Message
replayed ESubtreeMap at 12471325829 subtree root 0x1 is not mine in cache (it's -2,-2)
*** Caught signal (Segmentation fault) **
The journal looks like OK.
cephfs-journal-tool --rank=cephfs:all journal inspect
Overall journal integrity: OK
I took a backup from the journal via.
cephfs-journal-tool --rank=cephfs:all journal export backup.bin
and an export from the cephfs_metadata pool.
rados -p cephfs_metadata export cephfs_metadata_backup
A short Output of the events.
cephfs-journal-tool --rank=cephfs:all event get list
https://pastebin.com/jUDTQL2U[1]
I would take the following actions to recover the mds failure:
cephfs-journal-tool --rank=cephfs:all event recover_dentries summary
cephfs-journal-tool --rank=cephfs:all journal reset
cephfs-table-tool --rank=cephfs:all reset session
Any suggestions how to fix this?
Thanks
--
Carsten Feuls
--------
[1] https://pastebin.com/jUDTQL2U
I am not sure how to interpret CEPHADM_STRAY_HOST and
CEPHADM_STRAY_DAEMON warnings. They seem to be inconsistent.
I converted my cluster to be managed by cephadm by adopting
mon and all other daemons, and they show up in ceph orch ps,
but ceph health says mons are stray:
[WRN] CEPHADM_STRAY_HOST: 6 stray host(s) with 6 daemon(s)
not managed by cephadm
stray host ceph-6.icecube.wisc.edu has 1 stray daemons:
['mon.ceph-6']
...
At the same time, mon.ceph-6 is not mentioned in the
CEPHADM_STRAY_DAEMON section, which seems to contradict the
message about mon.ceph-6 being stray.
Any suggestions how to fix this?
Thanks,
Vlad
I recently configured Prometheus to scrape mgr /metrics and add Grafana
dashboards. All daemons at 15.2.11
I use Hashicorp consul to advertise the active mgr in DNS, and Prometheus
points at a single DNS target. (Is anyone else using this method, or just
statically pointing Prometheus at all potentially active managers?)
All was working fine initially, and it's *mostly* still working fine. For
the first couple of days, all went well, and then a few rate metrics
stopped meaningfully increasing — essentially pegged at zero, which is
implausible in a healthy cluster. Some cluster maintenance was occurring
such as outing and recreating some OSDs, so I have a baseline for
throughput and recovery.
Metric graphs that stopped functioning:
Throughput: ceph_osd_op_r_out_bytes, ceph_osd_op_w_in_bytes,
ceph_osd_op_rw_in_bytes
Recovery: ceph_osd_recovery_ops
I can see that Grafana output is using this method of converting the
counters to rates:
sum(irate(ceph_osd_recovery_ops{job="$job"}[$interval]))
The underlying counters appear to be sane, and reading the raw values from
prometheus is also valid, so I'm guessing some failure either of the irate
or sum functions? By inspection in Grafana, the queries return correct
timestamps with zero values, so that leaves us with "sum(irate)" as the
likely source of the problem.
Does anyone have experience with this? I admit it is possibly tangential
to ceph itself, but as the Prometheus/grafana integration is more or less
supported, I thought I'd try here first.
--
Jeremy Austin
jhaustin(a)gmail.com
We just upgraded to pacific, and I'm trying to clear warnings about legacy bluestore omap usage stats by running 'ceph-bluestore-tool repair`, as instructed by the warning message. It's been going fine, but we are now getting this error:
[root@vanilla bin]# ceph-bluestore-tool repair --path $osd_path
2021-05-19T19:25:26.485+0000 7f67ca3593c0 -1 bluestore(/var/lib/ceph/osd/ceph-9) fsck error: found stray omap data on omap_head 12256434 0 0
repair status: remaining 1 error(s) and warning(s)
[root@vanilla bin]# ceph-bluestore-tool fsck --path $osd_path -deep
2021-05-19T20:03:17.002+0000 7f4d1d6603c0 -1 bluestore(/var/lib/ceph/osd/ceph-9) fsck error: found stray omap data on omap_head 12256434 0 0
fsck status: remaining 1 error(s) and warning(s)
We're only 10% of the way through our OSDs, so I'd like to find some way to fix this other than destroying and rebuilding the OSD, in case it happens again. Fixing this error is especially attractive since we can't get out of HEALTH_WARN until we've run recover on all OSDs.
Any suggestions?
Neale Pickett <neale(a)lanl.gov>
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
Hi together,
I still search for orphan objects and came across a strange bug:
There is a huge multipart upload happening (around 4TB), and listing the
rados objects in the bucket loops over the multipart upload.
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.