I have a seemingly strange situation. I have three OSDs that I created with Ceph Octopus using the `ceph orch daemon add <host>:device` command. All three were added and everything was great. Then I rebooted the host. Now the daemon’s won’t start via Docker. When I attempt to run the `docker` command directly it errors with:
root@balin:/var/lib/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667/osd.12# /usr/bin/docker run --rm --net=host --privileged --group-add=disk --name ceph-c3d06c94-bb66-4f84-bf78-470a2364b667-osd.12 -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e NODE_NAME=balin -v /var/run/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667:/var/run/ceph:z -v /var/log/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667:/var/log/ceph:z -v /var/lib/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667/crash:/var/lib/ceph/crash:z -v /var/lib/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667/osd.12:/var/lib/ceph/osd/ceph-12:z -v /var/lib/ceph/c3d06c94-bb66-4f84-bf78-470a2364b667/osd.12/config:/etc/ceph/ceph.conf:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm --entrypoint /usr/bin/ceph-osd docker.io/ceph/ceph:v15 -n osd.12 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix="debug "
debug 2020-05-07T22:58:06.258+0000 7f622a161ec0 0 set uid:gid to 167:167 (ceph:ceph)
debug 2020-05-07T22:58:06.258+0000 7f622a161ec0 0 ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable), process ceph-osd, pid 1
debug 2020-05-07T22:58:06.258+0000 7f622a161ec0 0 pidfile_write: ignore empty --pid-file
debug 2020-05-07T22:58:06.258+0000 7f622a161ec0 -1 bluestore(/var/lib/ceph/osd/ceph-12/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-12/block: (13) Permission denied
debug 2020-05-07T22:58:06.258+0000 7f622a161ec0 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-12: (2) No such file or directory
The OSDs are able to come back online if I run `ceph-volume lvm activate —all`. Everything from a usage point of view is fine, even after a reboot, however I now have errors in the `ceph orch ps` list:
osd.12 balin error 27s ago - <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
This is an Ubuntu 20.04 system, FWIW. I haven’t a clue where to go from here. While things are technically working since the OSDs are online and functioning, I’d really like to have them under the `ceph orch` management like the rest of the systems.
~Sean
Hi,
===
NOTE: I do not see my thread in ceph-list for some reason. I don't know if list received my question or not. So, sorry if this is duplicate.
===
I just deployed a new cluster with cephadm instead of ceph-deploy. In tyhe past, If i change ceph.conf for tweaking, i was able to copy them and apply to all servers. But i cannot find this on new cephadm tool. I did few changes on ceph.conf but ceph is unaware of those changes. How can i apply them? I've used it with docker. Thanks, Gencer.
Hello,
Sorry if this has been asked before...
A few months ago I deployed a small Nautilus cluster using
ceph-ansible. The OSD nodes have multiple spinning drives and a PCI
NVMe. Now that the cluster has been stable for a while it's time to
start optimizing performance.
While I can tell that there is a part of the NVMe associated with each
OSD, I'm trying to verify which BlueStore components are using the NVMe
- WAL, DB, Cache - and whether the configuration generated by
ceph-ansible (and my settings in osds.yml) is optimal for my hardware.
I've searched around a bit and, while I have found documentation on how
to configure, reconfigure, and repair a BlueStore OSD, I haven't found
anything on how to query the current configuration.
Could anybody point me to a command or link to documentation on this?
Thanks.
-Dave
--
Dave Hall
Binghamton University
Quick question Ceph guru's.
For a 1.1PB raw cephfs system currently storing 191TB of data and 390 million objects (mostly small Python, ML training files etc.) how many MDS servers should I be running?
System is Nautilus 14.2.8.
I ask because up to know I have run one MDS with one standby-replay and occasionally it blows up with large memory consumption, 60Gb+ even though I have mds_cache_memory_limit = 32G and that was 16G until recently. It of course tries to restart on another MDS node fails again and after several attempts usually comes back up. Today I increased to two active MDS's but the question is what is the optimal number for a pretty active system? The single MDS seemed to regularly run around 1400 req/s and I often get up to six clients failing to respond to cache pressure.
The current setup is:
ceph fs status
cephfs - 71 clients
======
+------+----------------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+----------------+--------+---------------+-------+-------+
| 0 | active | a | Reqs: 447 /s | 12.0M | 11.9M |
| 1 | active | b | Reqs: 154 /s | 1749k | 1686k |
| 1-s | standby-replay | c | Evts: 136 /s | 1440k | 1423k |
| 0-s | standby-replay | d | Evts: 402 /s | 16.8k | 298 |
+------+----------------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 160G | 169G |
| cephfs_data | data | 574T | 140T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| w |
| x |
| y |
| z |
+-------------+
MDS version: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Regards.
Robert Ruge
Systems & Network Manager
Faculty of Science, Engineering & Built Environment
[cid:image001.png@01D36789.04BE09A0]
Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus free.
This is the second time this happened in a couple of weeks. The MDS locks
up and the stand-by can't take over so the Montiors black list them. I try
to unblack list them, but they still say this in the logs
mds.0.1184394 waiting for osdmap 234947 (which blacklists prior instance)
Looking at a pg dump, it looks like the epoch is passed that.
$ ceph pg map 3.756
osdmap e234953 pg 3.756 (3.756) -> up [113,180,115] acting [113,180,115]
Last time, it seemed to just recover after about an hour all by it's self.
Any way to speed this up?
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
In the Nautilus manual it recommends >= 4.14 kernel for multiple active
MDSes. What are the potential issues for running the 4.4 kernel with
multiple MDSes? We are in the process of upgrading the clients, but at
times overrun the capacity of a single MDS server.
MULTIPLE ACTIVE METADATA SERVERS
<https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…>
The feature has been supported since the Luminous release. It is
recommended to use Linux kernel clients >= 4.14 when there are multiple
active MDS.
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Hi
I have a few questions about bucket versioning.
in the output of command "*radosgw-admin bucket stats --bucket=XXX"* there
is info about versions:
"ver": "0#521391,1#516042,2#518098,3#517681,4#518423",
"master_ver": "0#0,1#0,2#0,3#0,4#0",
Also "*metadata get"* returns info about versions:
radosgw-admin metadata get bucket:XXX
{
"key": "bucket:XXX",
"ver": {
"tag": "_KrvQc6gBg1Zcrr8s8M5jXmk",
"ver": 335
},
But I'm pretty sure that bucket versioning should be disabled, because "*aws
s3api get-bucket-versioning*" returns nothing.
How should I understand the current situation?
The problem is that from the client side I can see that the bucket is very
small. Less than 10GB while checking the bucket stats from radosgw-admins
side shows the bucket is taking nearly 1TB.
Kind regards / Pozdrawiam,
Katarzyna Myrek
Hi Manuel,
My replica is 2, hence about 10TB of unaccounted usage.
Andrei
----- Original Message -----
> From: "EDH - Manuel Rios" <mriosfer(a)easydatahost.com>
> To: "Andrei Mikhailovsky" <andrei(a)arhont.com>
> Sent: Tuesday, 28 April, 2020 23:57:20
> Subject: RE: rados buckets copy
> Is your replica x3? 9x3 27... plus some overhead rounded....
>
> Ceph df show including replicas , bucket stats just bucket usage no replicas.
>
> -----Mensaje original-----
> De: Andrei Mikhailovsky <andrei(a)arhont.com>
> Enviado el: miércoles, 29 de abril de 2020 0:55
> Para: ceph-users <ceph-users(a)ceph.io>
> Asunto: [ceph-users] rados buckets copy
>
> Hello,
>
> I have a problem with radosgw service where the actual disk usage (ceph df shows
> 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB
> usage). I have tried to get to the end of the problem, but no one seems to be
> able to help. As a last resort I will attempt to copy the buckets, rename them
> and remove the old buckets.
>
> What is the best way of doing this (probably on a high level) so that the copy
> process doesn't carry on the wasted space to the new buckets?
>
> Cheers
>
> Andrei
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to
> ceph-users-leave(a)ceph.io
Dear all,
Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8.
I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual:
ceph osd set norebalance
deploy additional OSD
The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times:
Long heartbeat ping times on back interface seen, longest is 1171.910 msec
Long heartbeat ping times on front interface seen, longest is 1180.764 msec
After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes.
I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks.
We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions).
I have three questions, ordered by how urgently I need an answer:
1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out:
ceph osd set noout
ceph osd set nodown
ceph osd set norebalance
2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident?
3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again.
Many thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
I saw there was a clone_range function in librados earlier. But it removed in version 12 I beleive. I need excatly that function to avoid unnecessary network traffic.
I need to combine many small objects into one. So clone range would be really useful for me. I can read from an object and write to another but this will cause unnecessary network traffic.
How can I do this in new versions of librados ?