Hi Robert,
> Another option is if both RDMA ports are on the same card, then you can do RDMA with a bond. This does not work if you have two separate cards.
Yes, we recently talked to Mellanox and their engineers also recommend this way.
> As far as your questions go, my guess would be that you would want to have the different NICs in different broadcast domains
yes, the idea was to use two public networks on two different NICs with addresses from different subnets. It is possible to set 2+ networks in Ceph configuration, but it’s unclear how Ceph is going to use this configuration.
> or set up Source Based Routing and bind the source port on the connection (not the easiest, but allows you to have multiple NICs in the same broadcast domain). I don't have experience with Ceph in this type of configuration.
it’s too complicated and, frankly, when you’re trying to reach max performance, Source Based Routing is a bit from another area :-)
At the end of the all, we’re going to test bonding of two ports on same NIC.
Thank you.
--
Volodymyr Litovka
"Vision without Execution is Hallucination." -- Thomas Edison
> On Fri, Aug 2, 2019 at 9:41 AM Volodymyr Litovka <doka.ua(a)gmx.com <mailto:doka.ua@gmx.com>> wrote:
> Dear colleagues,
>
> at the moment, we use Ceph in routed environment (OSPF, ECMP) and everything is ok, reliability is high and there is nothing to complain about. But for hardware reasons (to be more precise - RDMA offload), we are faced with the need to operate Ceph directly on physical interfaces.
>
> According to documentation, "We generally recommend that dual-NIC systems either be configured with two IPs on the same network, or bonded."
>
> Q1: Did anybody test and can explain, how Ceph will behave in first scenario (two IPs on the same network)? I think this configuration require just one statement in 'public network' (where both interfaces reside)? How it will distribute traffic between links, how it will detect link failures and how it will switchover?
>
> Q2: Did anybody test a bit another scenario - both NICs have addresses in different networks and Ceph configuration contain two 'public networks'? Questions are same - how Ceph distributes traffic between links and how it recovers from link failures?
>
> Thank you.
>
> --
> Volodymyr Litovka
> "Vision without Execution is Hallucination." -- Thomas Edison
> _______________________________________________
> ceph-users mailing list
> ceph-users(a)lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
Hi All,
The RGW log always reached the disk capacity. Is there approach to disable RGW log, change the log level or change the default log path?
Thanks,
Yang
Hello again,
Getting back to this:
On Sun, 4 Aug 2019 10:47:27 +0900 Christian Balzer wrote:
> Hello,
>
> preparing the first production bluestore, nautilus (latest) based cluster
> I've run into the same things other people and myself ran into before.
>
> Firstly HW, 3 nodes with 12 SATA HDDs each, IT mode LSI 3008, wal/db on
> 40GB SSD partitions. (boy do I hate the inability of ceph-volume to deal
> with raw partitions).
> SSDs aren't a bottleneck in any scenario.
> Single E5-1650 v3 @ 3.50GHz, cpu isn't a bottleneck in any scenario, less
> than 15% of a core per OSD.
>
> Connection is via 40GB/s infiniband, IPoIB, no issues here as numbers later
> will show.
>
> Clients are KVMs on Epyc based compute nodes, maybe some more speed could
> be squeezed out here with different VM configs, but the cpu isn't an issue
> in the problem cases.
>
>
>
> 1. 4k random I/O can cause degraded PGs
> I've run into the same/similar issue as Nathan Fish here:
> https://www.spinics.net/lists/ceph-users/msg526
> During the first 2 tests with 4k random I/O I got shortly degraded PGs as
> well, with no indication in CPU or SSD utilization accounting for this.
> HDDs were of course busy at that time.
> Wasn't able to reproduce this so far, but it leaves me less than
> confident.
>
>
This happened again yesterday when rsyncing 260GB of average 4MB files
into a Ceph image backed VM.
Given the nature of this rsync nothing on the ceph nodes was the least bit
busy, the HDDs were all below 15% utilization, CPU bored, etc.
Still we got:
---
2019-08-07 15:38:23.452580 osd.21 (osd.21) 651 : cluster [DBG] 1.125 starting backfill to osd.9 from (0'0,0'0] MAX to 1297'21584
2019-08-07 15:38:24.454942 mon.ceph-05 (mon.0) 182756 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs peering (PG_AVAILABILITY)
2019-08-07 15:38:25.396756 mon.ceph-05 (mon.0) 182757 : cluster [DBG] osdmap e1302: 36 total, 36 up, 36 in
2019-08-07 15:38:23.452026 osd.12 (osd.12) 767 : cluster [DBG] 1.105 starting backfill to osd.25 from (0'0,0'0] MAX to 1297'6782
---
Unfortunately all I have in the OSD log is this:
---
2019-08-07 15:38:23.461 7f155e71b700 1 osd.9 pg_epoch: 1299 pg[1.125( empty local-lis/les=0/0 n=0 ec=189/189 lis/c 1286/1286 les/c/f 1287/1287/0 1298/1299/189) [21,9,28]/[21,28,3] r=-1 lpr=1299 pi=[1286,1299)/1 crt=0'0 unknown mbc={}] state<Start>: transitioning to Stray
2019-08-07 15:38:24.353 7f155e71b700 1 osd.9 pg_epoch: 1301 pg[1.125( v 1297'21584 (1246'18584,1297'21584] local-lis/les=1299/1300 n=5 ec=189/189 lis/c 1299/1299 les/c/f 1300/1300/0 1298/1301/189) [21,9,28] r=1 lpr=1301 pi=[1299,1301)/1 luod=0'0 crt=1297'21584 active mbc={}] start_peering_interval up [21,9,28] -> [21,9,28], acting [21,28,3] -> [21,9,28], acting_primary 21 -> 21, up_primary 21 -> 21, role -1 -> 1, features acting 4611087854031667199 upacting 4611087854031667199
2019-08-07 15:38:24.353 7f155e71b700 1 osd.9 pg_epoch: 1301 pg[1.125( v 1297'21584 (1246'18584,1297'21584] local-lis/les=1299/1300 n=5 ec=189/189 lis/c 1299/1299 les/c/f 1300/1300/0 1298/1301/189) [21,9,28] r=1 lpr=1301 pi=[1299,1301)/1 crt=1297'21584 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
---
How can I find out what happened here, given that it might not happen
again anytime soon cranking up debug levels now is a tad late.
Thanks,
Christian
--
Christian Balzer Network/Systems Engineer
chibi(a)gol.com Rakuten Mobile Inc.
Hi list,
I'm figuring if we would like rbd_store_chunk_size = 4 or
rbd_store_chunk_size = 8 (or maybe something different) on our new
OpenStack / Ceph environment.
Any opinions on this matter?
Cheers,
Kees
--
https://nefos.nl/contact
Nefos IT bv
Ambachtsweg 25 (industrienummer 4217)
5627 BZ Eindhoven
Nederland
KvK 66494931
/Aanwezig op maandag, dinsdag, woensdag en vrijdag/
Hello,
I work in Nokia as a Software QA engineer and I am trying to install Ceph on centOS 7.4 version.
But I am getting this error with the following output: -
vmgoscephcontrollerluminous][WARNIN] Error: Package: 2:ceph-common-12.2.12-0.el7.x86_64 (Ceph)
[mvmgoscephcontrollerluminous][WARNIN] Requires: liblz4.so.1()(64bit)
[mvmgoscephcontrollerluminous][WARNIN] Error: Package: 2:ceph-osd-12.2.12-0.el7.x86_64 (Ceph)
[mvmgoscephcontrollerluminous][WARNIN] Requires: liblz4.so.1()(64bit)
[mvmgoscephcontrollerluminous][WARNIN] Error: Package: 2:ceph-base-12.2.12-0.el7.x86_64 (Ceph)
[mvmgoscephcontrollerluminous][WARNIN] Requires: gperftools-libs >= 2.6.1
[mvmgoscephcontrollerluminous][WARNIN] Available: gperftools-libs-2.4-8.el7.i686 (base)
[mvmgoscephcontrollerluminous][DEBUG ] You could try using --skip-broken to work around the problem
[mvmgoscephcontrollerluminous][WARNIN] gperftools-libs = 2.4-8.el7
[mvmgoscephcontrollerluminous][WARNIN] Error: Package: 2:ceph-mon-12.2.12-0.el7.x86_64 (Ceph)
[mvmgoscephcontrollerluminous][WARNIN] Requires: liblz4.so.1()(64bit)
[mvmgoscephcontrollerluminous][WARNIN] Error: Package: 2:ceph-base-12.2.12-0.el7.x86_64 (Ceph)
[mvmgoscephcontrollerluminous][WARNIN] Requires: liblz4.so.1()(64bit)
[mvmgoscephcontrollerluminous][DEBUG ] You could try running: rpm -Va --nofiles --nodigest
[mvmgoscephcontrollerluminous][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y install ceph ceph-radosgw
Does anyone know about this issue like how can I resolve package dependencies?
Best Regards,
Ruchir Nerurkar
857-701-3405
I'm testing an upgrade to Nautilus on a development cluster and the
command "ceph device ls" is returning an empty list.
# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
#
I have walked through the luminous upgrade documentation under
https://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-o…
but I don't see anything pertaining to "activating" device support under
Nautilus.
The devices are visible to ceph-volume on the OSS nodes. ie:
osdev-stor1 ~]# ceph-volume lvm list
====== osd.0 =======
[block]
/dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b
block device
/dev/ceph-f5eb16ec-7074-477b-8f83-ce87c5f74fa3/osd-block-c1de464f-d838-4558-ba75-1c268e538d6b
block uuid dlbIm6-H5za-001b-C3mQ-EGks-yoed-zoQpoo
<snip>
devices /dev/sdb
====== osd.2 =======
[block]
/dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
block device
/dev/ceph-37145a74-6b2b-4519-b72e-2defe11732aa/osd-block-e06c513b-5af3-4bf6-927f-1f0142c59e8a
block uuid egdvpm-3bXx-xmNO-ACzp-nxax-Wka2-81rfNT
<snip>
devices /dev/sdc
Is there a step I missed?
Thanks.
Gary.
--
Gary Molenkamp Computer Science/Science Technology Services
Systems Administrator University of Western Ontario
molenkam(a)uwo.ca http://www.csd.uwo.ca
(519) 661-2111 x86882 (519) 661-3566
another update,
we now took the more destructive route and removed the cephfs pools
(lucky we had only test date in the filesystem)
Our hope was that within the startup-process the osd will delete the
no longer needed PG, But this is NOT the Case.
So we are still have the same issue the only difference is that the PG
does not belong to a pool anymore.
-360> 2019-08-07 14:52:32.655 7fb14db8de00 5 osd.44 pg_epoch: 196586
pg[23.f8s0(unlocked)] enter Initial
-360> 2019-08-07 14:52:32.659 7fb14db8de00 -1
/build/ceph-13.2.6/src/osd/ECUtil.h: In function
'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
7fb14db8de00 time 2019-08-07 14:52:32.660169
/build/ceph-13.2.6/src/osd/ECUtil.h: 34: FAILED assert(stripe_width %
stripe_size == 0)
we now can take one rout and try to delete the pg by hand in the OSD
(bluestore) how this can be done? OR we try to upgrade to Nautilus and
hope for the beset.
any help hints are welcome,
have a nice one
Ansgar
Am Mi., 7. Aug. 2019 um 11:32 Uhr schrieb Ansgar Jazdzewski
<a.jazdzewski(a)googlemail.com>:
>
> Hi,
>
> as a follow-up:
> * a full log of one OSD failing to start https://pastebin.com/T8UQ2rZ6
> * our ec-pool cration in the fist place https://pastebin.com/20cC06Jn
> * ceph osd dump and ceph osd erasure-code-profile get cephfs
> https://pastebin.com/TRLPaWcH
>
> as we try to dig more into it, it looks like a bug in the cephfs or
> erasure-coding part of ceph.
>
> Ansgar
>
>
> Am Di., 6. Aug. 2019 um 14:50 Uhr schrieb Ansgar Jazdzewski
> <a.jazdzewski(a)googlemail.com>:
> >
> > hi folks,
> >
> > we had to move one of our clusters so we had to boot all servers, now
> > we found an Error on all OSD with the EC-Pool.
> >
> > do we miss some opitons, will an upgrade to 13.2.6 help?
> >
> >
> > Thanks,
> > Ansgar
> >
> > 2019-08-06 12:10:16.265 7fb337b83200 -1
> > /build/ceph-13.2.4/src/osd/ECUtil.h: In function
> > 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread
> > 7fb337b83200 time 2019-08-06 12:10:16.263025
> > /build/ceph-13.2.4/src/osd/ECUtil.h: 34: FAILED assert(stripe_width %
> > stripe_size == 0)
> >
> > ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic
> > (stable) 1: (ceph::ceph_assert_fail(char const, char const, int, char
> > const)+0x102) [0x7fb32eeb83c2] 2: (()+0x2e5587) [0x7fb32eeb8587] 3:
> > (ECBackend::ECBackend(PGBackend::Listener, coll_t const&,
> > boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore,
> > CephContext, std::shared_ptr<ceph::ErasureCodeInterface>, unsigned
> > long)+0x4de) [0xa4cbbe] 4: (PGBackend::build_pg_backend(pg_pool_t
> > const&, std::map<std::cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> >,
> > std::cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> >, std::less<std::cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > >, std
> > ::allocator<std::pair<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const,
> > std::cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > > > > const&, PGBackend::Listener, coll_t,
> > boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore,
> > CephContext)+0x2f9 ) [0x9474e9] 5:
> > (PrimaryLogPG::PrimaryLogPG(OSDService, std::shared_ptr<OSDMap const>,
> > PGPool const&, std::map<std::cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> >,
> > std::cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> >, std::less<std::cxx11::basic_string<char,
> > std::char_tra its<char>, std::allocator<char> > >,
> > std::allocator<std::pair<std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const,
> > std::cxx11::basic_string<char, std::char_traits<char>,
> > std::allocator<char> > > > > const&, spg_t)+0x138) [0x8f96e8] 6:
> > (OSD::_make_pg(std::shared_ptr<OSDMap const>, spg_t)+0x11d3)
> > [0x753553] 7: (OSD::load_pgs()+0x4a9) [0x758339] 8:
> > (OSD::init()+0xcd3) [0x7619c3] 9: (main()+0x3678) [0x64d6a8] 10:
> > (libc_start_main()+0xf0) [0x7fb32ca68830] 11: (_start()+0x29)
> > [0x717389] NOTE: a copy of the executable, or objdump -rdS
> > <executable> is needed to interpret this.
> I can add RAM ans is there a way to increase rocksdb caching , can I
> increase bluestore_cache_size_hdd to higher value to cache rocksdb?
In recent releases it's governed by the osd_memory_target parameter. In
previous releases it's bluestore_cache_size_hdd. Check release notes to
know for sure.
> This we have planned to add some SSDs and how many OSD's rocks db we
> can add per SSDs and i guess if one SSD is down then all related OSDs
> has to be re-installed.
Yes. At least you'd better not put all 24 block.db's on a single SSD :)
4-8 HDDs per an SSD is usually fine. Also check db_used_bytes in `ceph
daemon osd.0 perf dump` (replace 0 with actual OSD numbers) to figure
out how much space your DBs use. If it's below 30gb you're lucky because
in that case DBs will fit on 30GB SSD partitions.
https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
--
Vitaliy Filippov
On Wed, Aug 7, 2019 at 9:30 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:
>> # ceph osd crush rule dump replicated_racks_nvme
>> {
>> "rule_id": 0,
>> "rule_name": "replicated_racks_nvme",
>> "ruleset": 0,
>> "type": 1,
>> "min_size": 1,
>> "max_size": 10,
>> "steps": [
>> {
>> "op": "take",
>> "item": -44,
>> "item_name": "default~nvme" <------------
>> },
>> {
>> "op": "chooseleaf_firstn",
>> "num": 0,
>> "type": "rack"
>> },
>> {
>> "op": "emit"
>> }
>> ]
>> }
>> ```
>
>
> Yes, our HDD cluster is much like this, but not Luminous, so we created as separate root with SSD OSD for the metadata and set up a CRUSH rule for the metadata pool to be mapped to SSD. I understand that the CRUSH rule should have a `step take default class ssd` which I don't see in your rule unless the `~` in the item_name means device class.
~ is the internal implementation of device classes. Internally it's
still using separate roots, that's how it stays compatible with older
clients that don't know about device classes.
And since it wasn't mentioned here yet: consider upgrading to Nautilus
to benefit from the new and improved accounting for metadata space.
You'll be able to see how much space is used for metadata and quotas
should work properly for metadata usage.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
>
> Thanks
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> _______________________________________________
> ceph-users mailing list
> ceph-users(a)lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com