Hi,
my current Crush Map includes multiple roots representing different disks.
There are multiple crush rules, one for each pool.
And a pool represents a disk type: hdd, ssd, nvme
Question:
What is the recommended procedure to modify the Crush Map in order to
define only one root and "transfer" all other roots to additional disk
types?
THX
Hi everyone,
I have a problem trying to add an ISCSI gateway. The following error is
generated when trying to add the new gateway:
iscsi-target...-igw/gateways> create ceph-iscsi3 192.168.201.3
Adding gateway, sync'ing 3 disk(s) and 2 client(s)
Failed : /etc/ceph/iscsi-gateway.cfg on ceph-iscsi3 does not match the
local version. Correct and retry request
But the file iscsi-gateway.cfg is exactly the same in all gateways.
Selinux is disabled, permissios are OK too.
I using ceph 13.2.6. Can anyone help-me?
Regards
Gesiel
Hi Ceph users !
After years of using Ceph, we plan to build soon a new cluster bigger than what
we've done in the past. As the project is still in reflection, I'd like to
have your thoughts on our planned design : any feedback is welcome :)
## Requirements
* ~1 PB usable space for file storage, extensible in the future
* The files are mostly "hot" data, no cold storage
* Purpose : storage for big files being essentially used on windows workstations (10G access)
* Performance is better :)
## Global design
* 8+3 Erasure Coded pool
* ZFS on RBD, exposed via samba shares (cluster with failover)
## Hardware
* 1 rack (multi-site would be better, of course...)
* OSD nodes : 14 x supermicro servers
* 24 usable bays in 2U rackspace
* 16 x 10 TB nearline SAS HDD (8 bays for future needs)
* 2 x Xeon Silver 4212 (12C/24T)
* 128 GB RAM
* 4 x 40G QSFP+
* Networking : 2 x Cisco N3K 3132Q or 3164Q
* 2 x 40G per server for ceph network (LACP/VPC for HA)
* 2 x 40G per server for public network (LACP/VPC for HA)
* QSFP+ DAC cables
## Sizing
If we've done the maths well, we expect to have :
* 2.24 PB of raw storage, extensible to 3.36 PB by adding HDD
* 1.63 PB expected usable space with 8+3 EC, extensible to 2.44 PB
* ~1 PB of usable space if we want to keep the OSD use under 66% to allow
loosing nodes without problem, extensible to 1.6 PB (same condition)
## Reflections
* We're used to run mons and mgrs daemons on a few of our OSD nodes, without
any issue so far : is this a bad idea for a big cluster ?
* We thought using cache tiering on an SSD pool, but a large part of the PB is
used on a daily basis, so we expect the cache to be not so effective and
really expensive ?
* Could a 2x10G network be enough ?
* ZFS on Ceph ? Any thoughts ?
* What about CephFS ? We'd like to use RBD diff for backups but it looks
impossible to use snapshot diff with Cephfs ?
Thanks for reading, and sharing your experiences !
F.
Hi Robert,
I am not quite sure if I get your question correct, but what I understand
is that you want the inbound writes to land on the cache tier, which
presumably would be on a faster media, possibily a ssd.
From there you would want it to trickle down to the base tier, which is a
EC pool hosted on HDD.
Some of the pointers I have :-
It is better to have seperate media for base and cache , HDD and SSD
respectively.
If the intent is never to promote to cache tier on Read, you could set it
to a high number such as 3, and at the same time, Make the bloom filter
window small.( This basically translates into if the object has been read X
number of times in past y seconds)
Keep in mind the larger the window, the more the size of the bloom filter,
and hence you would see a increase in osd memory usgae.
I have patch somewhere lurking which disables the promotes, let me check on
the same, if this is for a specific case.
If your intent is to have a constant decay rate from the Cache tier to the
base tier, here is what you could do.:-
1.Set the Max Objects on the Cache tier TO X
2.Set the Max Size to say Y, this would be normally 60-70 percent of the
total cache tier capacity.
3.The flushes would start happening on the first trigger of the above
thresholds.
4. You could set the evict age roughly double the time, you expect the data
will hit the base tier.
5.Lastly have you tried running cosbench or any related tool, to qualify
the IOPS of your base tier with EC enabled, you may. Or require the cache
tier at all.
6. There are substantial overheads of a cache tier maintenance, the major
being absence of throttles on how the flush happens.
7.A thundering herd of write requests can cause a huge amount of flush to
happen to the base tier.
8.IMHO it is suitable and predictable for loads where number of ingress
requests can be predicted and there is some kind of rate limiting on the
same.
Hope this helps
Thanks
Romit
On Tue, 3 Dec 2019, 04:11 , <ceph-users-request(a)ceph.io> wrote:
> Send ceph-users mailing list submissions to
> ceph-users(a)ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-request(a)ceph.io
>
> You can reach the person managing the list at
> ceph-users-owner(a)ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
> 1. Re: ceph node crashed with these errors "kernel: ceph:
> build_snap_context" (maybe now it is urgent?)
> (Ilya Dryomov)
> 2. Re: ceph node crashed with these errors "kernel: ceph:
> build_snap_context" (maybe now it is urgent?)
> (Marc Roos)
> 3. Re: ceph node crashed with these errors "kernel: ceph:
> build_snap_context" (maybe now it is urgent?)
> (Marc Roos)
> 4. Re: Possible data corruption with 14.2.3 and 14.2.4
> (Simon Ironside)
> 5. Re: ceph node crashed with these errors "kernel: ceph:
> build_snap_context" (maybe now it is urgent?)
> (Marc Roos)
> 6. Can min_read_recency_for_promote be -1 (Robert LeBlanc)
>
>
> ----------------------------------------------------------------------
>
> Date: Mon, 2 Dec 2019 14:59:05 +0100
> From: Ilya Dryomov <idryomov(a)gmail.com>
> Subject: [ceph-users] Re: ceph node crashed with these errors "kernel:
> ceph: build_snap_context" (maybe now it is urgent?)
> To: Marc Roos <M.Roos(a)f1-outsourcing.eu>
> Cc: ceph-users <ceph-users(a)ceph.io>, jlayton <jlayton(a)kernel.org>
> Message-ID:
> <
> CAOi1vP-uyxeaKvuxUQbe2nsuXH9-f6_QxcggOCv6LrCBzugJOw(a)mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Mon, Dec 2, 2019 at 1:23 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
> >
> >
> >
> > I guess this is related? kworker 100%
> >
> >
> > [Mon Dec 2 13:05:27 2019] SysRq : Show backtrace of all active CPUs
> > [Mon Dec 2 13:05:27 2019] sending NMI to all CPUs:
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 0 skipped: idling at pc
> > 0xffffffffb0581e94
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 1 skipped: idling at pc
> > 0xffffffffb0581e94
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 2 skipped: idling at pc
> > 0xffffffffb0581e94
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 3 skipped: idling at pc
> > 0xffffffffb0581e94
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 4
> > [Mon Dec 2 13:05:27 2019] CPU: 4 PID: 426200 Comm: kworker/4:2 Not
> > tainted 3.10.0-1062.4.3.el7.x86_64 #1
> > [Mon Dec 2 13:05:27 2019] Hardware name: Supermicro
> > X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
> > [Mon Dec 2 13:05:27 2019] Workqueue: ceph-msgr ceph_con_workfn
> > [libceph]
> > [Mon Dec 2 13:05:27 2019] task: ffffa0c8e1240000 ti: ffffa0ccb6364000
> > task.ti: ffffa0ccb6364000
> > [Mon Dec 2 13:05:27 2019] RIP: 0010:[<ffffffffc08d7db9>]
> > [<ffffffffc08d7db9>] cmpu64_rev+0x19/0x20 [ceph]
> > [Mon Dec 2 13:05:27 2019] RSP: 0018:ffffa0ccb6367a20 EFLAGS: 00000202
> > [Mon Dec 2 13:05:27 2019] RAX: 0000000000000001 RBX: 0000000000000038
> > RCX: 0000000000000008
> > [Mon Dec 2 13:05:27 2019] RDX: 0000000000025c33 RSI: ffffa0cbbe380050
> > RDI: ffffa0cbbe380030
> > [Mon Dec 2 13:05:27 2019] RBP: ffffa0ccb6367a20 R08: 0000000000000018
> > R09: 00000000000013ed
> > [Mon Dec 2 13:05:27 2019] R10: 0000000000000002 R11: ffffe94994f8e000
> > R12: ffffa0cbbe380030
> > [Mon Dec 2 13:05:27 2019] R13: ffffffffc08d7da0 R14: ffffa0cbbe380018
> > R15: ffffa0cbbe380050
> > [Mon Dec 2 13:05:27 2019] FS: 0000000000000000(0000)
> > GS:ffffa0d2cfb00000(0000) knlGS:0000000000000000
> > [Mon Dec 2 13:05:27 2019] CS: 0010 DS: 0000 ES: 0000 CR0:
> > 0000000080050033
> > [Mon Dec 2 13:05:27 2019] CR2: 000055a7c413fcb9 CR3: 0000001813010000
> > CR4: 00000000000607e0
> > [Mon Dec 2 13:05:27 2019] Call Trace:
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb019303f>] sort+0x1af/0x260
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb0192e60>] ? u32_swap+0x10/0x10
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08d807b>]
> > build_snap_context+0x12b/0x290 [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08d820c>]
> > rebuild_snap_realms+0x2c/0x90 [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08d822b>]
> > rebuild_snap_realms+0x4b/0x90 [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08d91fc>]
> > ceph_update_snap_trace+0x3ec/0x530 [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08e2239>]
> > handle_reply+0x359/0xc60 [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc08e48ba>] dispatch+0x11a/0xb00
> > [ceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb042e56a>] ?
> > kernel_recvmsg+0x3a/0x50
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc05fcff4>] try_read+0x544/0x1300
> > [libceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafee13ce>] ?
> > account_entity_dequeue+0xae/0xd0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafee4d5c>] ?
> > dequeue_entity+0x11c/0x5e0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb042e417>] ?
> > kernel_sendmsg+0x37/0x50
> > [Mon Dec 2 13:05:27 2019] [<ffffffffc05fdfb4>]
> > ceph_con_workfn+0xe4/0x1530 [libceph]
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb057f568>] ?
> > __schedule+0x448/0x9c0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafebe21f>]
> > process_one_work+0x17f/0x440
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafebf336>]
> > worker_thread+0x126/0x3c0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafebf210>] ?
> > manage_workers.isra.26+0x2a0/0x2a0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafec61f1>] kthread+0xd1/0xe0
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafec6120>] ?
> > insert_kthread_work+0x40/0x40
> > [Mon Dec 2 13:05:27 2019] [<ffffffffb058cd37>]
> > ret_from_fork_nospec_begin+0x21/0x21
> > [Mon Dec 2 13:05:27 2019] [<ffffffffafec6120>] ?
> > insert_kthread_work+0x40/0x40
> > [Mon Dec 2 13:05:27 2019] Code: 87 c8 fc ff ff 5d 0f 94 c0 0f b6 c0 c3
> > 0f 1f 44 00 00 66 66 66 66 90 48 8b 16 48 39 17 b8 01 00 00 00 55 48 89
> > e5 72 08 0f 97 c0 <0f> b6 c0 f7 d8 5d c3 66 66 66 66 90 55 f6 05 ed 92
> > 02 00 04 48
> > [Mon Dec 2 13:05:27 2019] NMI backtrace for cpu 5
>
> Yes, seems related. I'm not sure how it relates to an upgrade to
> nautilus, but as I mentioned in a different message, with thousands of
> snapshots you are in a dangerous territory anyway.
>
> Thanks,
>
> Ilya
>
> ------------------------------
>
> Date: Mon, 2 Dec 2019 15:06:54 +0100
> From: "Marc Roos" <M.Roos(a)f1-outsourcing.eu>
> Subject: [ceph-users] Re: ceph node crashed with these errors "kernel:
> ceph: build_snap_context" (maybe now it is urgent?)
> To: idryomov <idryomov(a)gmail.com>
> Cc: ceph-users <ceph-users(a)ceph.io>, jlayton <jlayton(a)kernel.org>
> Message-ID: <"H000007100158998.1575295614.sx.f1-outsourcing.eu*"@MHS>
> Content-Type: text/plain; charset="UTF-8"
>
> >
> >> >
> >> >ISTR there were some anti-spam measures put in place. Is your
> account
> >> >waiting for manual approval? If so, David should be able to help.
> >>
> >> Yes if I remember correctly I get waiting approval when I try to log
> in.
> >>
> >> >>
> >> >>
> >> >>
> >> >> Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9287
> >> >> ffff911a9a26bd00 fail -12
> >> >> Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9283
> >> >
> >> >
> >> >It is failing to allocate memory. "low load" isn't very specific,
> >> >can you describe the setup and the workload in more detail?
> >>
> >> 4 nodes (osd, mon combined), the 4th node has local cephfs mount,
> which
> >> is rsync'ing some files from vm's. 'low load' I have sort of test
> setup,
> >> going to production. Mostly the nodes are below a load of 1 (except
> when
> >> the concurrent rsync starts)
> >>
> >> >How many snapshots do you have?
> >>
> >> Don't know how to count them. I have script running on a 2000 dirs.
> If
> >> one of these dirs is not empty it creates a snapshot. So in theory I
> >> could have 2000 x 7 days = 14000 snapshots.
> >> (btw the cephfs snapshots are in a different tree than the rsync is
> >> using)
> >
> >Is there a reason you are snapshotting each directory individually
> >instead of just snapshotting a common parent?
>
> Yes because I am not sure the snapshot frequency on all folders is going
> to be the same.
>
> >If you have thousands of snapshots, you may eventually hit a different
> >bug:
> >
> >https://tracker.ceph.com/issues/21420
> >https://docs.ceph.com/docs/master/cephfs/experimental-features/#snapsh
> ots
> >
> >Be aware that each set of 512 snapshots amplify your writes by 4K in
> >terms of network consumption. With 14000 snapshots, a 4K write would
> >need to transfer ~109K worth of snapshot metadata to carry itself out.
> >
>
> Also when I am not even writing to a tree with snapshots enabled? I am
> rsyncing to dir3
>
> .
> ├── dir1
> │ ├── dira
> │ │ └── .snap
> │ ├── dirb
> │ ├── dirc
> │ │ └── .snap
> │ └── dird
> │ └── .snap
> ├── dir2
> └── dir3
>
> ------------------------------
>
> Date: Mon, 2 Dec 2019 16:29:07 +0100
> From: "Marc Roos" <M.Roos(a)f1-outsourcing.eu>
> Subject: [ceph-users] Re: ceph node crashed with these errors "kernel:
> ceph: build_snap_context" (maybe now it is urgent?)
> To: idryomov <idryomov(a)gmail.com>
> Cc: ceph-users <ceph-users(a)ceph.io>, jlayton <jlayton(a)kernel.org>
> Message-ID: <"H000007100158aca.1575300547.sx.f1-outsourcing.eu*"@MHS>
> Content-Type: text/plain; charset="UTF-8"
>
>
> I can confirm that removing all the snapshots seems to resolve the
> problem.
>
> A - I would propose a redesign of something like that snapshots from
> below the mountpoint are only taken into account and not snapshots in
> the entire filesystem. That should fix a lot of issues
>
> B - That reminds me about this mv command, that does not move data
> across different pools in the fs. I would like to see this. Because it
> is the logical thing to expect.
>
>
>
>
> >
> >> >
> >> >ISTR there were some anti-spam measures put in place. Is your
> account >> >waiting for manual approval? If so, David should be able
> to help.
> >>
> >> Yes if I remember correctly I get waiting approval when I try to log
> in.
> >>
> >> >>
> >> >>
> >> >>
> >> >> Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9287
> >> >> ffff911a9a26bd00 fail -12 >> >> Dec 1 03:14:36 c04 kernel:
> ceph: build_snap_context 100020c9283 >> > >> > >> >It is failing
> to allocate memory. "low load" isn't very specific, >> >can you
> describe the setup and the workload in more detail?
> >>
> >> 4 nodes (osd, mon combined), the 4th node has local cephfs mount,
> which >> is rsync'ing some files from vm's. 'low load' I have sort of
> test setup, >> going to production. Mostly the nodes are below a load
> of 1 (except when >> the concurrent rsync starts) >> >> >How many
> snapshots do you have?
> >>
> >> Don't know how to count them. I have script running on a 2000 dirs.
> If
> >> one of these dirs is not empty it creates a snapshot. So in theory I
> >> could have 2000 x 7 days = 14000 snapshots.
> >> (btw the cephfs snapshots are in a different tree than the rsync is
> >> using) > >Is there a reason you are snapshotting each directory
> individually >instead of just snapshotting a common parent?
>
> Yes because I am not sure the snapshot frequency on all folders is going
> to be the same.
>
> >If you have thousands of snapshots, you may eventually hit a different
> >bug:
> >
> >https://tracker.ceph.com/issues/21420
> >https://docs.ceph.com/docs/master/cephfs/experimental-features/#snapsh
> ots
> >
> >Be aware that each set of 512 snapshots amplify your writes by 4K in
> >terms of network consumption. With 14000 snapshots, a 4K write would
> >need to transfer ~109K worth of snapshot metadata to carry itself out.
> >
>
> Also when I am not even writing to a tree with snapshots enabled? I am
> rsyncing to dir3
>
> .
> ├── dir1
> │ ├── dira
> │ │ └── .snap
> │ ├── dirb
> │ ├── dirc
> │ │ └── .snap
> │ └── dird
> │ └── .snap
> ├── dir2
> └── dir3
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
>
> ------------------------------
>
> Date: Mon, 2 Dec 2019 15:54:54 +0000
> From: Simon Ironside <sironside(a)caffetine.org>
> Subject: [ceph-users] Re: Possible data corruption with 14.2.3 and
> 14.2.4
> To: ceph-users(a)ceph.io
> Message-ID: <21d057e9-0088-4847-6d40-19cf2c848395(a)caffetine.org>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Any word on 14.2.5? Nervously waiting here . . .
>
> Thanks,
> Simon.
>
> On 18/11/2019 11:29, Simon Ironside wrote:
>
> > I will sit tight and wait for 14.2.5.
> >
> > Thanks again,
> > Simon.
>
> ------------------------------
>
> Date: Mon, 2 Dec 2019 19:32:03 +0100
> From: "Marc Roos" <M.Roos(a)f1-outsourcing.eu>
> Subject: [ceph-users] Re: ceph node crashed with these errors "kernel:
> ceph: build_snap_context" (maybe now it is urgent?)
> To: ceph-users <ceph-users(a)ceph.io>, lhenriques <lhenriques(a)suse.com>
> Message-ID: <"H000007100158b41.1575311519.sx.f1-outsourcing.eu*"@MHS>
> Content-Type: text/plain; charset="ISO-8859-1"
>
>
> Yes Luis, good guess!! ;)
>
>
>
> -----Original Message-----
> Cc: ceph-users
> Subject: Re: [ceph-users] ceph node crashed with these errors "kernel:
> ceph: build_snap_context" (maybe now it is urgent?)
>
> On Mon, Dec 02, 2019 at 10:27:21AM +0100, Marc Roos wrote:
> >
> > I have been asking before[1]. Since Nautilus upgrade I am having
> > these, with a total node failure as a result(?). Was not expecting
> > this in my 'low load' setup. Maybe now someone can help resolving
> > this? I am also waiting quite some time to get access at
> > https://tracker.ceph.com/issues.
>
> Just a wild guess: do you have a lot of snapshots (> ~400)? If so,
> that's probably the problem. See [1] and [2].
>
> [1]
> https://docs.ceph.com/docs/master/cephfs/experimental-features/#snapshots
> [2] https://tracker.ceph.com/issues/21420
>
> Cheers,
> --
> Luís
>
> >
> >
> > Dec 1 03:14:36 c04 kernel: ceph: build_snap_context 100020c9287
> > ffff911a9a26bd00 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9283 ffff911d34e69d00 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9276
> > ffff911d34e69c00 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c926c ffff912068b92c00 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9268
> > ffff912068b93000 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c926d ffff912068b92900 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c928a
> > ffff912118e5be00 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9272 ffff9119950d9500 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9269
> > ffff911940f3d000 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9270 ffff911748427c00 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c926b
> > ffff91169b000600 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9281 ffff91169b000500 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9288
> > ffff9115844d2500 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c927d ffff9115844d2e00 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9280
> > ffff91186401b000 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9267 ffff9121535ecc00 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c927c
> > ffff9121cecb1e00 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9271 ffff9121cecb0400 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9279
> > ffff911d26646300 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c927f ffff911d26646900 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9275
> > ffff9121cecb1700 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9259 ffff91170c9f6600 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9257
> > ffff9118ef2a8000 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c924e ffff911a1e091800 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9262
> > ffff911a1e090c00 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9266 ffff9115e3859500 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c924f
> > ffff9118aefd1300 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c925f ffff91170c9f6100 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9252
> > ffff9115e3859800 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9256 ffff912045dc5300 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020c9254
> > ffff91170c9f6900 fail -12 Dec 1 03:14:36 c04 kernel: ceph:
> > build_snap_context 100020c9261 ffff91170c9f7100 fail -12 Dec 1
> > 03:14:36 c04 kernel: ceph: build_snap_context 100020d4ec4
> > ffff9118aefd0000 fail -12
> >
> > [1]
> > https://www.mail-archive.com/ceph-users@ceph.io/msg01088.html
> > https://www.mail-archive.com/ceph-users@ceph.io/msg00969.html
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
>
>
> ------------------------------
>
> Date: Mon, 2 Dec 2019 14:39:26 -0800
> From: Robert LeBlanc <robert(a)leblancnet.us>
> Subject: [ceph-users] Can min_read_recency_for_promote be -1
> To: ceph-users <ceph-users(a)ceph.io>
> Message-ID:
> <
> CAANLjFoecdW7oBh78L3dNO83C-DpDmqXw-kKtT+ShNKXjsqKJg(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="00000000000024d4be0598c041cd"
>
> --00000000000024d4be0598c041cd
> Content-Type: text/plain; charset="UTF-8"
>
> I'd like to configure a cache tier to act as a write buffer, so that if
> writes come in, it promotes objects, but reads never promote an object. We
> have a lot of cold data so we would like to tier down to an EC pool
> (CephFS) after a period of about 30 days to save space. The storage tier
> and the 'cache' tier would be on the same spindles, so the only performance
> improvement would be from the faster writes with replication. So we don't
> want to really move data between tiers.
>
> The idea would be to not promote on read since EC read performance is good
> enough and have writes go to the cache tier where the data may be 'hot' for
> a week or so, then get cold.
>
> It seems that we would only need one hit_set and if -1 can't be set for
> min_read_recency_for_promote, I could probably use 2 which would never hit
> because there is only one set, but that may error too. The follow up is how
> big a set should be as it only really tells if an object "may" be in cache
> and does not determine when things are flushed, so it really only matters
> how out-of-date we are okay with the bloom filter being out of date, right?
> So we could have it be a day long if we are okay with that stale rate? Is
> there any advantage to having a longer period for a bloom filter? Now, I'm
> starting to wonder if I even need a bloom filter for this use case, can I
> get tiering to work without it and only use
> cache_min_flush_age/cach_min_evict_age since I don't care about promoting
> when there are X hits in Y time?
>
> Thanks
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
> --00000000000024d4be0598c041cd
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr">I'd like to configure a cache tier to act as a write
> b=
> uffer, so that if writes come in, it promotes objects, but reads never
> prom=
> ote an object. We have a lot of cold data so we would like to tier down to
> =
> an EC pool (CephFS) after a period of about 30 days to save space. The
> stor=
> age tier and the 'cache' tier would be on the same spindles, so
> the=
> only performance improvement would be from the faster writes with
> replicat=
> ion. So we don't want to really move data between
> tiers.<div><br></div>=
> <div>The idea would be to not promote on read since EC read performance is
> =
> good enough and have writes go to the cache tier where the data may be
> '=
> ;hot' for a week or so, then get cold.</div><div><br></div><div>It
> seem=
> s that we would only need one hit_set and if -1 can't be set for
> min_re=
> ad_recency_for_promote, I could probably use 2 which would never hit
> becaus=
> e there is only one set, but that may error too. The follow up is how big
> a=
> set should be as it only really tells if an object "may" be in
> c=
> ache and does not determine when things are flushed, so it really only
> matt=
> ers how out-of-date we are okay with the bloom filter being out of date,
> ri=
> ght? So we could have it be a day long if we are okay with that stale
> rate?=
> Is there any advantage to having a longer period for a bloom filter? Now,
> =
> I'm starting to wonder if I even need a bloom filter for this use
> case,=
> can I get tiering to work without it and only use
> cache_min_flush_age/cach=
> _min_evict_age since I don't care about promoting when there are X
> hits=
> in Y time?</div><div><br></div><div>Thanks<br clear=3D"all"><div><div dir=
> =3D"ltr" class=3D"gmail_signature"
> data-smartmail=3D"gmail_signature">-----=
> -----------<br>Robert LeBlanc<br>PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 =
> =C2=A0C70E E654 3BB2 FA62 B9F1</div></div></div></div>
>
> --00000000000024d4be0598c041cd--
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 83, Issue 5
> *****************************************
>
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi all,
We have a Ceph(version 12.2.4)cluster that adopts EC pools, and it consists
of 10 hosts for OSDs.
The corresponding commands to create the EC pool are listed as follows:
ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
plugin=jerasure \
k=4 \
m=3 \
technique=reed_sol_van \
packetsize=2048 \
crush-device-class=hdd \
crush-failure-domain=host
ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure
profile_jerasure_4_3_reed_sol_van
Since that the EC pool's crush-failure-domain is configured to be "host",
we just disable the network interfaces of some hosts (using "ifdown"
command) to verify the functionality of the EC pool.
And here are the phenomena we have observed:
First of all, the IO rate (of "rados bench", which we used for benchmark)
drops immediately to 0 when one host goes offline.
Secondly, it takes a lot of time (around 100 seconds) for Ceph to detect
corresponding OSDs on that host are down.
Finally, once the Ceph has detected all offline OSDs, the EC pool seems to
act normally and it is ready for IO operations again.
So, here are my questions:
1. Is this normal that the IO rate drops to 0 immediately even though there
is only one host goes offline?
2. How to make Ceph reduce the time needed to detect failed OSDs?
Thanks for any help.
Best regards,
Majia Xiao
I created one ceph cluster.
node-1: mon, mgr, osd.0, mds
node-2: mon, mgr, osd.1, mds
node-3: mon, mgr, osd.2, mds
When the cluster is working normally, using command "mount -t ceph <node-*-ip:6789>:/ /mnt -o name=admin,secret=<admin client secret>" to mount is ok.
But when a node unusual down(like poweroff), and using same command to mount, it will be hanging long time(maybe more then 1 minutes).
I trying to configure "mds reconnect timeout = 0" in ceph.conf, mount time has been shortened.
My Question:
What are the configurations that affect the ceph file system mount in this scenario?
BRs.
hfx(a)portsip.cn
Hi everyone,
We've identified a data corruption bug[1], first introduced[2] (by yours
truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption
appears as a rocksdb checksum error or assertion that looks like
os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated)
or in some cases a rocksdb checksum error. It only affects BlueStore OSDs
that have a separate 'db' or 'wal' device.
We have a fix[3] that is working its way through testing, and will
expedite the next Nautilus point release (14.2.5) once it is ready.
If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with
separate 'db' volumes, you should consider waiting to upgrade
until 14.2.5 is released.
A big thank you to Igor Fedotov and several *extremely* helpful users who
managed to reproduce and track down this problem!
sage
[1] https://tracker.ceph.com/issues/42223
[2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad…
[3] https://github.com/ceph/ceph/pull/31621
I'd like to configure a cache tier to act as a write buffer, so that if
writes come in, it promotes objects, but reads never promote an object. We
have a lot of cold data so we would like to tier down to an EC pool
(CephFS) after a period of about 30 days to save space. The storage tier
and the 'cache' tier would be on the same spindles, so the only performance
improvement would be from the faster writes with replication. So we don't
want to really move data between tiers.
The idea would be to not promote on read since EC read performance is good
enough and have writes go to the cache tier where the data may be 'hot' for
a week or so, then get cold.
It seems that we would only need one hit_set and if -1 can't be set for
min_read_recency_for_promote, I could probably use 2 which would never hit
because there is only one set, but that may error too. The follow up is how
big a set should be as it only really tells if an object "may" be in cache
and does not determine when things are flushed, so it really only matters
how out-of-date we are okay with the bloom filter being out of date, right?
So we could have it be a day long if we are okay with that stale rate? Is
there any advantage to having a longer period for a bloom filter? Now, I'm
starting to wonder if I even need a bloom filter for this use case, can I
get tiering to work without it and only use
cache_min_flush_age/cach_min_evict_age since I don't care about promoting
when there are X hits in Y time?
Thanks
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1