Hi,
With ceph 15.2.5 octopus, mon, mgd and rgw dump loggings on debug
level to stdout/stderr. It causes huge container log file
(/var/lib/docker/containers/<ID>/<ID>-json.log).
Is there any way to stop dumping logs or change the logging level?
BTW, I tried "ceph config set <service> log_to_stderr false".
It doesn't help.
Thanks!
Tony
Hi,
The docs have scant detail on doing a migration to bluestore using a
per-osd device copy:
https://docs.ceph.com/en/latest/rados/operations/bluestore-migration/#per-o…
This mentions "using the copy function of ceph-objectstore-tool", but
ceph-objectstore-tool doesn't have a copy function (all the way from v9 to
current).
Has anyone actually tried doing this?
Is there any further detail available on what is involved, e.g. a broad
outline of the steps?
Of course, detailed instructions would be even better, even if accompanied
by "here be dragons!" warnings.
Cheers,
Chris
Hello everyone,I enabled rgw ops log by setting "rgw_enable_ops_log =
true",There is a "total_time" field in rgw ops log
But I want to figure out whether "total_time" includes the period of time
when rgw returns a response to the client?
Hello,
hope you had a nice Xmas and I wish all of you a good and happy new year
in advance...
Yesterday my ceph nautilus 14.2.15 cluster had a disk with unreadable
sectors, after several tries the OSD was marked down and rebalancing
started and has also finished successfully. ceph osd stat shows the osd
now as "autoout,exists".
Usually the steps to replace a failed disk are:
1. Destroy the failed OSD: ceph osd destroy {id}
2. run ceph-volume lvm create --bluestore --osd-id {id} --data /dev/sdX
... with a new disk in place to recreate a OSD with the same id without
the need to change the crushmap or auth info etc.
Now I still wait for a new disk and I am a unsure if I should run the
destroy-command already now to keep ceph from trying to reactivate the
broken osd? Then I would wait until the disk has arrived in a day or so
and then use ceph volume to create a new osd?
Or should I leave the state as it is now until the disk has arrived and
then run both steps (destroy, volume ceph-lvm-create) one right after
the other?
Do the two slightly different ways make any difference if for example a
power failure would result in a reboot of the node with the failed OSD
before I could replace the broken disk?
Any comments on this?
Thanks
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
1001312
Hi Eugen,
Indeed some really useful tips explaining what goes wrong, yet this
thread[1] is about cephfs directly mounted on the osd node. I was having
this also quite some time without any problems until suddenly I ran into
the same issue as they had. I think I did not have any issues with the
kernel client cephfs mount with luminous until I enabled the cephfs
snapshots. Then I had to switch to the fuse client.
In my case I am running a vm on the the osd node, which I thought would
be different. I have been able to reproduce this stale mount just 2
times now. I have been testing with 10x more clients and it still works.
Anyway I decided to move everything to rbd. I have been running vm's
with rbd images without problems colocated on osd nodes for quite some
time. I really would like to use the hosts because they have each
16c/32t and an average load of just 2-3.
Unfortunately I did not document precisely how I recovered from the
stale mount. I would like to see if I can reduce the amount of steps to
take. Things started happening for me after I did the mds failover. Then
I got blocked clients that I could unblock and fix the mount with mount
-l
Thanks for the pointers, I have linked them in my docs ;)
-----Original Message-----
To: ceph-users(a)ceph.io
Subject: [ceph-users] Re: kvm vm cephfs mount hangs on osd node
(something like umount -l available?) (help wanted going to production)
Hi,
there have been several threads about hanging cephfs mounts, one quite
long thread [1] describes a couple of debugging options but also
mentions to avoid mounting cephfs on OSD nodes in a production
environment.
Do you see blacklisted clients with 'ceph osd blacklist ls'? If the
answer is yes try to unblock that client [2].
The same option ('umount -l') is available on a cephfs client, you can
try that, too. Other options described in [1] are to execute an MDS
failover, but sometimes a reboot of that VM is the only solution left.
Regards,
Eugen
[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028719.html
[2]
https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-un-blocklisting-a…
Zitat von Marc Roos <M.Roos(a)f1-outsourcing.eu>:
> Is there not some genius out there that can shed a ligth on this? ;)
> Currently I am not able to reproduce this. Thus it would be nice to
> have some procedure at hand that resolves stale cephfs mounts nicely.
>
>
> -----Original Message-----
> To: ceph-users
> Subject: [ceph-users] kvm vm cephfs mount hangs on osd node (something
> like umount -l available?) (help wanted going to production)
>
>
>
> I have a vm on a osd node (which can reach host and other nodes via
> the macvtap interface (used by the host and guest)). I just did a
> simple
> bonnie++ test and everything seems to be fine. Yesterday however the
> dovecot procces apparently caused problems (only using cephfs for an
> archive namespace, inbox is on rbd ssd, fs meta also on ssd)
>
> How can I recover from such lock-up. If I have a similar situation
> with an nfs-ganesha mount, I have the option to do a umount -l, and
> clients recover quickly without any issues.
>
> Having to reset the vm, is not really an option. What is best way to
> resolve this?
>
>
>
> Ceph cluster: 14.2.11 (the vm has 14.2.16)
>
> I have in my ceph.conf nothing special, these 2x in the mds section:
>
> mds bal fragment size max = 120000
> # maybe for nfs-ganesha problems?
> # http://docs.ceph.com/docs/master/cephfs/eviction/
> #mds_session_blacklist_on_timeout = false
> #mds_session_blacklist_on_evict = false mds_cache_memory_limit =
> 17179860387
>
>
> All running:
> CentOS Linux release 7.9.2009 (Core)
> Linux mail04 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC
> 2020 x86_64 x86_64 x86_64 GNU/Linux
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
email to ceph-users-leave(a)ceph.io
What is the easiest and best way to migrate bucket from an old cluster to a new one?
Luminous to octopus not sure does it matter from the data perspective.
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system seeing the daemon fail and then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail this way at separate times. We resorted to wiping the affected OSD and re-adding it to the cluster, but it seems as soon as all PGs have moved to the OSD, the next one fails.
This is also keeping us from re-deploying RGW, which was affected by the same out of memory incident, since cephadm runs a check and won’t deploy the service unless the cluster is in HEALTH_OK status.
Any help would be greatly appreciated.
Thanks,
Stefan
Hello Ceph Users,
Since upgrading from Nautilus to Octopus ( cluster started in luminous ) I have been trying to debug why the RocksDB/WAL is maxing out the SSD drives. ( QD > 32, 12000 read IOPS, 200 write IOPS ).
The omap upgrade on migration was disabled initially but I reenabled it and restarted all OSD's. This completed without issue.
I have increased the memory target from 4 to 6GB per OSD but it doesn't look like it is using it all anyway ( based on top ).
I have offline compacted all OSDs. This seems to help for about 4-6 hours ( backfilling is occuring - maybe this triggers it? ).
RGW garbage collection is upto date.
Pg_log on some PG's are high due to them not being in a clean state ( 8% PGs > 3000 ) remainder of PG's I have reduced to 500 logs - no change.
I've been working on this issue for days not without much luck. Nothing in the logs indicates a major issue.
The client impact is a major reduction in speed.
{
"mon": {
"ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 5
},
"mgr": {
"ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 1
},
"osd": {
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 18,
"ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 280
},
"mds": {},
"rgw": {
"ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 2
},
"tcmu-runner": {
"ceph version 14.2.13-450-g65ea1b614d (65ea1b614db8b6d10f334a8ff67c4de97f73bcbf) nautilus (stable)": 2
},
"overall": {
"ceph version 14.2.13-450-g65ea1b614d (65ea1b614db8b6d10f334a8ff67c4de97f73bcbf) nautilus (stable)": 2,
"ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 18,
"ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 288
}
}
Any assistance in debugging would be greatly helpful.
Glen
This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately.
I just went to setup an iscsi gateway on a Debian Buster / Octopus
cluster and hit a brick wall with packages. I had perhaps naively
assumed they were in with the rest. Now I understand that it can exist
separately, but then so can RGW.
I found some ceph-iscsi rpm builds for Centos, but nothing for Debian.
Are they around somewhere? The prerequisite packages
python-rtslib-2.1.fb68 and tcmu-runner-1.4.0 also don't seem to be
readily available for Debian.
Has anyone done this for Debian?
Thanks, Chris
Dear All,
Hope you all had a great Christmas and much needed time off with family!
Have any of you used "*device management and failure prediction"* in
Nautilus? If yes, what is your feedback? Do you use LOCAL or CLOUD
prediction models?
https://ceph.io/update/new-in-nautilus-device-management-and-failure-predic…
Your feedback and input is valuable.
--
Regards,
Suresh