Hi,
We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4.
Our cephfs clients are using the kernel module and we have noticed that
some of them are sometimes (at least once) hanging after an MDS restart.
The only way to resolve this is to unmount and remount the mountpoint,
or reboot the machine if unmounting is not possible.
After some investigation, the problem seems to be that the MDS denies
reconnect attempts from some clients during restart even though the
reconnect interval is not yet reached. In particular, I see the following
log entries. Note that there are supposedly 9 sessions. 9 clients
reconnect (one client has two mountpoints) and then two more clients
reconnect after the MDS already logged "reconnect_done". These two
clients were hanging after the event. The kernel log of one of them is
shown below too.
Running `ceph tell mds.0 client ls` after the clients have been
rebooted/remounted also shows 11 clients instead of 9.
Do you have any ideas what is wrong here and how it could be fixed? I'm
guessing that the issue is that the MDS apparently has an incorrect
session count and stops the reconnect process to soon. Is this indeed a
bug and if so, do you know what is broken?
Regardless, I also think that the kernel should be able to deal with a
denied reconnect and that it should try again later. Yet, even after
10 minutes, the kernel does not attempt to reconnect. Is this a known
issue or maybe fixed in newer kernels? If not, is there a chance to get
this fixed?
Thanks,
Florian
MDS log:
> 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0
> 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done
> 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close
> 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45)
> 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby
> 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close
> 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45)
> 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby
Hanging client (10.1.67.49) kernel log:
> 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start
> 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied
> 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING)
> 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session
> 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm
Hi Dave and everyone else affected,
I'm responding to a thread you opened on an issue with lvm OSD creation:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YYH3VANVV22…https://tracker.ceph.com/issues/43868
Most important question: is there a workaround?
My observations: I'm running into the exact same issue on mimic 13.2.10. The strange thing is, that some OSDs get created and others fail. I can't see a pattern here. I have 1 host where every create worked out and another where half failed. The important lines in the log are probably:
stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bluestore(/var/lib/ceph/osd/ceph-342/) _read_fsid unparsable uuid
stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bdev(0x561db199c700 /var/lib/ceph/osd/ceph-342//block) _aio_start io_setup(2) failed with EAGAIN; try increasing /proc/sys/fs/aio-max-nr
stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 bluestore(/var/lib/ceph/osd/ceph-342/) mkfs failed, (11) Resource temporarily unavailable
stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 OSD::mkfs: ObjectStore::mkfs failed with error (11) Resource temporarily unavailable
stderr: 2021-02-06 13:48:27.477 7f46756b4b80 -1 [0;31m ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-342/: (11) Resource temporarily unavailable[0m
I really need to get a decent number of disks up very soon. Any help is appreciated. I can provide more output if that helps.
Best regards and good weekend!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
Imagine this situation:
- 3 servers with ceph
- a pool with size 2 min 1
I know perfectly the size 3 and min 2 is better.
I would like to know what is the worst thing that can happen:
- a disk breaks and another disk breaks before ceph has reconstructed
second replica, ok I lose data
- if network goes down and so monitors lose quorum do ceph still write on
disks?
What else?
Thanks,
Mario
Hi all,
My monitor nodes are getting up and down because of paxos lease timeout and
there is a high iops (2k iops) and 500MB/s throughput on
/var/lib/ceph/mon/ceph.../store.db/.
My cluster is in a recovery state and there is a bunch of degraded pgs on
my cluster.
It seems it's doing a 200k block size io on rocksdb. Is that okay?!
Also is there any solution to fix these downtimes for monitors?
Thanks for your help!
Hi all,
I'm experimenting with ceph-volume on Centos7, ceph mimic 13.2.10. When I execute "ceph-volume deactivate ..." on a previousy activated OSD, I get this error:
# ceph-volume lvm deactivate 12 0bbf481c-6a3d-4724-9a27-3a845eb05911
stderr: /usr/bin/findmnt: invalid option -- 'M'
stderr: Usage:
Help summary of findmnt at the bottom. I could hack a script together that forwards this call to findmnt and translates the option. Can anyone tell me if this is a bug or if I'm running with the wrong version?
In the mean time, is ceph-volume deactivate just unmounting /var/lib/ceph/osd/ceph-ID directories, or is there more to it (no dmcrypt in use)?
Thanks!
# findmnt --help
Usage:
findmnt [options]
findmnt [options] <device> | <mountpoint>
findmnt [options] <device> <mountpoint>
findmnt [options] [--source <device>] [--target <mountpoint>]
Options:
-s, --fstab search in static table of filesystems
-m, --mtab search in table of mounted filesystems
-k, --kernel search in kernel table of mounted
filesystems (default)
-p, --poll[=<list>] monitor changes in table of mounted filesystems
-w, --timeout <num> upper limit in milliseconds that --poll will block
-A, --all disable all built-in filters, print all filesystems
-a, --ascii use ASCII chars for tree formatting
-c, --canonicalize canonicalize printed paths
-D, --df imitate the output of df(1)
-d, --direction <word> direction of search, 'forward' or 'backward'
-e, --evaluate convert tags (LABEL,UUID,PARTUUID,PARTLABEL)
to device names
-F, --tab-file <path> alternative file for --fstab, --mtab or --kernel options
-f, --first-only print the first found filesystem only
-i, --invert invert the sense of matching
-l, --list use list format output
-N, --task <tid> use alternative namespace (/proc/<tid>/mountinfo file)
-n, --noheadings don't print column headings
-u, --notruncate don't truncate text in columns
-O, --options <list> limit the set of filesystems by mount options
-o, --output <list> the output columns to be shown
-P, --pairs use key="value" output format
-r, --raw use raw output format
-t, --types <list> limit the set of filesystems by FS types
-v, --nofsroot don't print [/dir] for bind or btrfs mounts
-R, --submounts print all submounts for the matching filesystems
-S, --source <string> the device to mount (by name, maj:min,
LABEL=, UUID=, PARTUUID=, PARTLABEL=)
-T, --target <string> the mountpoint to use
-h, --help display this help and exit
-V, --version output version information and exit
Available columns:
SOURCE source device
TARGET mountpoint
FSTYPE filesystem type
OPTIONS all mount options
VFS-OPTIONS VFS specific mount options
FS-OPTIONS FS specific mount options
LABEL filesystem label
UUID filesystem UUID
PARTLABEL partition label
PARTUUID partition UUID
MAJ:MIN major:minor device number
ACTION action detected by --poll
OLD-TARGET old mountpoint saved by --poll
OLD-OPTIONS old mount options saved by --poll
SIZE filesystem size
AVAIL filesystem size available
USED filesystem size used
USE% filesystem use percentage
FSROOT filesystem root
TID task ID
ID mount ID
OPT-FIELDS optional mount fields
PROPAGATION VFS propagation flags
FREQ dump(8) frequency in days [fstab only]
PASSNO pass number on parallel fsck(8) [fstab only]
For more details see findmnt(8).
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
I was in the middle of a rebalance on a small test cluster with about 1% of
pgs degraded, and shut the cluster entirely down for maintenance.
On startup, many pgs are entirely unknown, and most stale. In fact most pgs
can't be queried! No mon failures. Would osd logs tell me why pgs aren't
even moving to an inactive state?
I'm not concerned about data loss due to the shutdown (all activity to the
cluster had been stopped), so should I be setting some or all OSDs "
osd_find_best_info_ignore_history_les = true"?
Thank you,
--
Jeremy Austin
jhaustin(a)gmail.com
On Thu, Feb 4, 2021 at 10:30 PM huxiaoyu(a)horebdata.cn
<huxiaoyu(a)horebdata.cn> wrote:
>
> >IMO with a cluster this size, you should not ever mark out any OSDs --
> >rather, you should leave the PGs degraded, replace the disk (keep the
> >same OSD ID), then recover those objects to the new disk.
> >Or, keep it <40% used (which sounds like a waste).
>
> Dear Dan,
>
> I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do.
>
> Could you please share some more details on how to achieve this, or some scripts already being tested?
Hi Samuel,
To do this you'd first set a config so that down osds aren't
automatically marked out, either by setting
`mon_osd_down_out_interval` to a really large value, or perhaps
mon_osd_down_out_subtree_limit = osd would work too.
Then when the osd fails, PGs will become degraded.
You just zap, then replace, then recreate the OSD. Here are the
ceph-volume commands for that:
* ceph-volume lvm zap --osd-id 1234 --destroy
* (then you might need `ceph osd destroy 1234`)
* (then replace the drive)
* ceph-volume lvm create <according to your hw> --osd-id 1234
Cheers, Dan
>
> thanks a lot,
>
> Samuel
>
>
>
> ________________________________
> huxiaoyu(a)horebdata.cn
>
>
> From: Dan van der Ster
> Date: 2021-02-04 11:57
> To: Mario Giammarco
> CC: Ceph Users
> Subject: [ceph-users] Re: Worst thing that can happen if I have size= 2
> On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco(a)gmail.com> wrote:
> >
> >
> >
> > Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan(a)vanderster.com> ha scritto:
> >>
> >>
> >> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.
> >>
> >
> > I will investigate I heard that erasure coding is slow.
> >
> > Anyway I will write here the reason of this thread:
> > In my customers I have usually proxmox+ceph with:
> >
> > - three servers
> > - three monitors
> > - 6 osd (two per server)
> > - size=3 and min_size=2
> >
> > I followed the recommendations to stay safe.
> > But one day one disk of one server has broken, osd where at 55%.
> > What happened then?
> > Ceph started filling the remaining OSD to maintain size=3
> > OSD reached 90% ceph stopped all.
> > Customer VMs froze and customer lost time and some data that was not written on disk.
> >
> > So I got angry.... size=3 and customer still loses time and data?
>
> You should size the osd fullness config in such a way that failure you
> expect would still leave sufficient capacity.
> In our case, we plan so that we could lose and re-replicate an entire
> rack and still have enough space left. -- (IOW, with 5-6 racks, we
> start to add capacity when the clusters reach ~70-75% full)
>
> In your case, the issue is more extreme:
> Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
> fails and is marked out, you are telling ceph that *all* of the
> objects will need to be written to the last remaining disk on that
> host with the failure.
> So unless your cluster was under 40-50% used, that osd is going to
> become overfull. (But BTW, ceph will get backfillfull on the loaded
> OSD before stopping IO -- this should not have blocked your user
> unless they *also* filled the disk with new data at the same time).
>
> IMO with a cluster this size, you should not ever mark out any OSDs --
> rather, you should leave the PGs degraded, replace the disk (keep the
> same OSD ID), then recover those objects to the new disk.
> Or, keep it <40% used (which sounds like a waste).
>
> -- dan
>
>
>
>
>
> >
> >
> >
> >
> >
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco(a)gmail.com> wrote:
> >>>
> >>> Thanks Simon and thanks to other people that have replied.
> >>> Sorry but I try to explain myself better.
> >>> It is evident to me that if I have two copies of data, one brokes and while
> >>> ceph creates again a new copy of the data also the disk with the second
> >>> copy brokes you lose the data.
> >>> It is obvious and a bit paranoid because many servers on many customers run
> >>> on raid1 and so you are saying: yeah you have two copies of the data but
> >>> you can broke both. Consider that in ceph recovery is automatic, with raid1
> >>> some one must manually go to the customer and change disks. So ceph is
> >>> already an improvement in this case even with size=2. With size 3 and min 2
> >>> it is a bigger improvement I know.
> >>>
> >>> What I ask is this: what happens with min_size=1 and split brain, network
> >>> down or similar things: do ceph block writes because it has no quorum on
> >>> monitors? Are there some failure scenarios that I have not considered?
> >>> Thanks again!
> >>> Mario
> >>>
> >>>
> >>>
> >>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> >>> sironside(a)caffetine.org> ha scritto:
> >>>
> >>> > On 03/02/2021 09:24, Mario Giammarco wrote:
> >>> > > Hello,
> >>> > > Imagine this situation:
> >>> > > - 3 servers with ceph
> >>> > > - a pool with size 2 min 1
> >>> > >
> >>> > > I know perfectly the size 3 and min 2 is better.
> >>> > > I would like to know what is the worst thing that can happen:
> >>> >
> >>> > Hi Mario,
> >>> >
> >>> > This thread is worth a read, it's an oldie but a goodie:
> >>> >
> >>> >
> >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.ht…
> >>> >
> >>> > Especially this post, which helped me understand the importance of
> >>> > min_size=2
> >>> >
> >>> >
> >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.ht…
> >>> >
> >>> > Cheers,
> >>> > Simon
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- ceph-users(a)ceph.io
> >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>> >
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users(a)ceph.io
> >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi,
With 15.2.8, run "ceph orch rm osd 12 --replace --force",
PGs on osd.12 are remapped, osd.12 is removed from "ceph osd tree",
the daemon is removed from "ceph orch ps", the device is "available"
in "ceph orch device ls". Everything seems good at this point.
Then dry-run service spec.
```
# cat osd-spec.yaml
service_type: osd
service_id: osd-spec
placement:
hosts:
- ceph-osd-1
data_devices:
rotational: 1
db_devices:
rotational: 0
# ceph orch apply osd -i osd-spec.yaml --dry-run
+---------+----------+------------+----------+----------+-----+
|SERVICE |NAME |HOST |DATA |DB |WAL |
+---------+----------+------------+----------+----------+-----+
|osd |osd-spec |ceph-osd-3 |/dev/sdd |/dev/sdb |- |
+---------+----------+------------+----------+----------+-----+
```
It looks as expected.
Then "ceph orch apply osd -i osd-spec.yaml".
Here is the log of cephadm.
```
/bin/docker:stderr --> relative data size: 1.0
/bin/docker:stderr --> passed block_db devices: 1 physical, 0 LVM
/bin/docker:stderr Running command: /usr/bin/ceph-authtool --gen-print-key
/bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
/bin/docker:stderr Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new b05c3c90-b7d5-4f13-8a58-f72761c1971b 12
/bin/docker:stderr Running command: /usr/sbin/vgcreate --force --yes ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64 /dev/sdd
/bin/docker:stderr stdout: Physical volume "/dev/sdd" successfully created.
/bin/docker:stderr stdout: Volume group "ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" successfully created
/bin/docker:stderr Running command: /usr/sbin/lvcreate --yes -l 572318 -n osd-block-b05c3c90-b7d5-4f13-8a58-f72761c1971b ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64
/bin/docker:stderr stderr: Volume group "ceph-a3886f74-3de9-4e6e-a983-8330eda0bd64" has insufficient free space (572317 extents): 572318 required.
/bin/docker:stderr --> Was unable to complete a new OSD, will rollback changes
```
Q1, why VG name (ceph-<id>) is different from others (ceph-block-<id>)?
Q2, where is that 572318 from? Since all HDDs are the same model, VG
"Total PE" of all HDDs is 572317.
Has anyone seen similar issues? Anything I am missing?
Thanks!
Tony
Hi,
I found 6-700 stale instances with the reshard stale instances list command.
Is there a way to clean it up (or actually should I clean it up)?
The stale instance rm doesn't work in multisite.
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.