Hi everyone!
I've a ceph cluster running AlmaLinux9, podman and Ceph Quincy (17.6.0).
Since yesterday I was having some problems but the last one is that the
ceph orch command is hanging up. I have seen the logs but I didn't find
anything relevant that could help to fix the problem.
Podman shows the daemons running and if I stop one daemon it appears again
after a few seconds.
I also tried *ceph mgr fail* command, daemon order changes, but ceph orch
still not working. ceph orch pause/resume also are not working. Disabling
and enabling *cephadm* module didn't help.
Any help to understand what's going on would be welcome.
Thanks in advance.
--
Dario Graña
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
http://www.pic.es
Avis - Aviso - Legal Notice: http://legal.ifae.es
Dear ceph users and developers,
we're struggling with strange issue which I think might be a bug
causing snapshot data corruption while migrating RBD image
we've tracked it to minimal set of steps to reproduce using VM
with one 32G drive:
rbd create --size 32768 sata/D2
virsh create xml_orig.xml
rbd snap create ssd/D1@snap1
rbd export-diff ssd/D1@snap1 - | rbd import-diff - sata/D2
rbd export --export-format 1 --no-progress ssd/D1@snap1 - | xxh64sum
505dde3c49775773
rbd export --export-format 1 --no-progress sata/D2@snap1 - | xxh64sum
505dde3c49775773 # <- checksums match - OK
virsh shutdown VM
rbd migration prepare ssd/D1 sata/D1Z
virsh create xml_new.xml
rbd snap create sata/D1Z@snap2
rbd export-diff --from-snap snap1 sata/D1Z@snap2 - | rbd import-diff - sata/D2
rbd migration execute sata/D1Z
rbd migration commit sata/D1Z
rbd export --export-format 1 --no-progress sata/D1Z@snap2 - | xxh64sum
19892545c742c1de
rbd export --export-format 1 --no-progress sata/D2@snap2 - | xxh64sum
cc045975baf74ba8 # <- snapshosts differ
OS is alma 9 based, kernel 5.15.105, CEPH 17.2.6, qemu-8.0.3
we tried disabling VM disk caches as well as discard, to no avail.
my first question is, is it correct to assume creating snapshots while live
migrating data is safe? if so, any ideas on where the problem could be?
If I could provide more info, please let me know
with regards
nikola ciprich
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis(a)linuxbox.cz
-------------------------------------
Hi all,
I see17.2.7 quincy is published as debian-bullseye packages. So I
tried it on a test cluster.
I must say I was not expecting the big dashboard change in a patch
release. Also all the "cluster utilization" numbers are all blank now
(any way to fix it?), so the dashboard is much less usable now.
Thoughts?
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hello,
I have 7 machines on CEPH cluster, the service ceph runs on a docker
container.
Each machine has 4 hdd of data (available) and 2 nvme sssd (bricked)
During a reboot, the ssd bricked on 4 machines, the data are available on
the HDD disk but the nvme is bricked and the system is not available. is it
possible to recover the data of the cluster (the data disk are all
available)
Hi
I wrote before about issues I was having with cephadm in 18.2.0 Sorry, I didn't see the helpful replies because my mail service binned the responses.
I still can't get the reef version of cephadm to work properly.
I had updated the system rpm to reef (ceph repo) and also upgraded the containerised ceph daemons to reef before my first email.
Both the system package cephadm and the one found at /var/lib/ceph/${fsid}/cephadm.* return the same error when running "cephadm version"
Traceback (most recent call last):
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9468, in <module>
main()
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 9456, in main
r = ctx.func(ctx)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2108, in _infer_image
ctx.image = infer_local_ceph_image(ctx, ctx.container_engine.path)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2191, in infer_local_ceph_image
container_info = get_container_info(ctx, daemon, daemon_name is not None)
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in get_container_info
matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid]
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 2154, in <listcomp>
matching_daemons = [d for d in daemons if daemon_name_or_type(d) == daemon_filter and d['fsid'] == ctx.fsid]
File "./cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e", line 217, in __getattr__
return super().__getattribute__(name)
AttributeError: 'CephadmContext' object has no attribute 'fsid'
I am running into other issues as well, but I think they may point back to this issue of "'CephadmContext' object has no attribute 'fsid'"
Any help would be appreciated.
Regards,
Martin Conway
IT and Digital Media Manager
Research School of Physics
Australian National University
Canberra ACT 2601
+61 2 6125 1599
https://physics.anu.edu.au<https://physics.anu.edu.au/>
Hi everyone,
We have recently noticed diskprediction_local module only works for a
set of manufacturers. Hence we have the following questions:
Are there any plans to support more manufacturers in the near future?
Can we contribute to the process of training new models and how?
Can the existing models be used with disks of other vendors (with
hackish methods)? Is it that It just doesn't work or it would not be
reliable? Basically how does the trained model and the vendor relation
work?
Say we have a disk for an existing model, what is the best way to test
the trained model and the module itself?
Thanks in advance
Hi,
I did setup CephFS forward scrub by executing cmd
# ceph tell mds.cephfs:0 scrub start / recursive
{
"return_code": 0,
"scrub_tag": "37a67f72-89a3-474e-8f8b-1e55cb979e14",
"mode": "asynchronous"
}
But immediately after it started, memory usage on MDS that keeps rank 0
increased from 2GB to 32GB until MDS was killed by OOM killer.
Why does it consume all the memory? Can it be adjusted somehow?
Thank you