Hi list,
I'm in the middle of an OpenStack migration (obviously Ceph backed) and
stumble into some huge virtual machines.
To ensure downtime is kept to a minimum, I'm thinking of using Ceph's
snapshot features using rbd export-diff and import-diff.
However, is it safe (or even supported) to do this across versions?
The source cluster is running 10.2.11 and destination is 12.2.11
Thanks in advance!
Regards,
Kees
--
https://nefos.nl/contact
Nefos IT bv
Ambachtsweg 25 (industrienummer 4217)
5627 BZ Eindhoven
Nederland
KvK 66494931
/Aanwezig op maandag, dinsdag, woensdag en vrijdag/
Hello,
We run nautilus 14.2.8 ceph cluster.
After a big crash in which we lost some disks we had a PG down (Erasure
coding 3+2 pool) and trying to fix it we followed this
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-…
As the PG was reported with 0 objects we first marked a shard as
complete with ceph-objectstore-tool and restart the osd.
The pg thus went active but reported lost objects !
As we consider the datas on this pg as lost, we try to get rid of this
with ceph pg 30.3 mark_unfound_lost delete.
This produced some logs like (~3 lines/hour):
2020-05-12 14:45:05.251830 osd.103 (osd.103) 886 : cluster [ERR] 30.3s0
Unexpected Error: recovery ending with 41:
{30:c000e27d:::rbd_data.34.c963b6314efb84.000000000
0000100:head=435293'2 flags =
delete,30:c01f1248:::rbd_data.34.7f0c0d1df22f45.0000000000000325:head=435293'3
flags = delete,30:c05e82b2:::rbd_data.34.674d063bdc66d2.0
000000000000015:head=435293'4 flags =
delete,30:c0b2d8e7:::rbd_data.34.6bc88749c741cb.00000000000007d0:head=435293'5
flags = delete,30:c0c3e20e:::rbd_data.34.674d063b
dc66d2.00000000000000fb:head=435293'6 flags =
delete,30:c0c89740:::rbd_data.34.a7f2202210bb39.0000000000000bbc:head=435293'7
flags = delete,30:c0e59ffa:::rbd_data.34.
7f0c0d1df22f45.00000000000002fb:head=435293'8 flags =
delete,30:c0e72bf4:::rbd_data.34.7f0c0d1df22f45.00000000000000fa:head=435293'9
flags = delete,30:c10ab507:::rbd_
data.34.80695c646d9535.0000000000000327:head=435293'10 flags =
delete,30:c219e412:::rbd_data.34.a7f2202210bb39.0000000000000fa0:head=435293'11
flags = delete,30:c29ae
ba3:::rbd_data.34.8038585a0eb9f6.0000000000000eb2:head=435293'12 flags =
delete,30:c29fae09:::rbd_data.34.674d063bdc66d2.000000000000148a:head=435293'13
flags = delet
e,30:c2b77a99:::rbd_data.34.7f0c0d1df22f45.000000000000031d:head=435293'14
flags =
delete,30:c2c8598f:::rbd_data.34.674d063bdc66d2.00000000000002f5:head=435293'15
fla
gs =
delete,30:c2dd39fe:::rbd_data.34.6494fb1b0f88bf.000000000000030b:head=435293'16
flags =
delete,30:c2f6ce39:::rbd_data.34.806ab864459ae5.0000000000000109:head=435
293'17 flags =
delete,30:c2f8a62f:::rbd_data.34.ed0c58ebdc770f.000000000000002a:head=435293'18
flags = delete,30:c306cd86:::rbd_data.34.ed0c58ebdc770f.000000000000020
5:head=435293'19 flags =
delete,30:c30f5230:::rbd_data.34.7f0c0d1df22f45.00000000000002f5:head=435293'20
flags = delete,30:c32b81df:::rbd_data.34.c79f6d1f78a707.00000
00000000100:head=435293'21 flags =
delete,30:c3374080:::rbd_data.34.7f217e33dd742c.00000000000007d0:head=435293'22
flags = delete,30:c3cdbeb5:::rbd_data.34.674dcefe97
f606.0000000000000109:head=435293'23 flags =
delete,30:c3cdd149:::rbd_data.34.674dcefe97f606.0000000000000019:head=435293'24
flags = delete,30:c40946c0:::rbd_data.34.
ded8d21a9d3d8f.00000000000002a8:head=435293'25 flags =
delete,30:c42ed4fd:::rbd_data.34.a6985314ad8dad.0000000000000200:head=435293'26
flags = delete,30:c483a99b:::rb
d_data.34.ed0c58ebdc770f.0000000000000a00:head=435293'27 flags =
delete,30:c49f09d6:::rbd_data.34.7e1c1abf436885.0000000000000bb8:head=435293'28
flags = delete,30:c51
5a4e8:::rbd_data.34.ed0c58ebdc770f.0000000000000106:head=435293'29 flags
=
delete,30:c5181a8e:::rbd_data.34.9385d45172fa0f.000000000000020c:head=435293'30
flags = del
ete,30:c531de44:::rbd_data.34.6bc88749c741cb.0000000000000102:head=435293'31
flags =
delete,30:c5427518:::rbd_data.34.806ab864459ae5.00000000000006db:head=435293'32
f
lags =
delete,30:c5693b53:::rbd_data.34.6494fb1b0f88bf.000000000000148a:head=435293'33
flags =
delete,30:c5804bc9:::rbd_data.34.ed0cb8730e020c.0000000000000105:head=4
35293'34 flags =
delete,30:c598117e:::rbd_data.34.7f0811fbac0b9d.0000000000000327:head=435293'35
flags = delete,30:c5a64fbd:::rbd_data.34.c963b6314efb84.0000000000000
010:head=435293'36 flags =
delete,30:c5f9e0e5:::rbd_data.34.ed0c58ebdc770f.0000000000000f01:head=435293'37
flags = delete,30:c5ffe1d8:::rbd_data.34.6bc88749c741cb.000
0000000000abe:head=435293'38 flags =
delete,30:c6ecfaa1:::rbd_data.34.9385d45172fa0f.0000000000000002:head=435293'39
flags = delete,30:c755550f:::rbd_data.34.6494fb1b
0f88bf.0000000000000106:head=435293'40 flags =
delete,30:c7a730f4:::rbd_data.34.7f217e33dd742c.00000000000006e1:head=435293'41
flags = delete,30:c7aa79f7:::rbd_data.3
4.674dcefe97f606.0000000000000108:head=435293'42 flags = delete}
But yesterday it started to flood the logs (~9 GB of logs/day !) with
lines like :
2020-05-14 10:36:03.851258 osd.29 [ERR] Error -2 reading object
30:c24a0173:::rbd_data.34.806ab864459ae5.000000000000022d:head
2020-05-14 10:36:03.851333 osd.29 [ERR] Error -2 reading object
30:c4a41972:::rbd_data.34.6bc88749c741cb.0000000000000320:head
2020-05-14 10:36:03.851382 osd.29 [ERR] Error -2 reading object
30:c543da6f:::rbd_data.34.80695c646d9535.0000000000000dce:head
2020-05-14 10:36:03.859900 osd.29 [ERR] Error -2 reading object
30:c24a0173:::rbd_data.34.806ab864459ae5.000000000000022d:head
2020-05-14 10:36:03.859979 osd.29 [ERR] Error -2 reading object
30:c4a41972:::rbd_data.34.6bc88749c741cb.0000000000000320:head
We think that the best would probably to completely delete this pg. Is
that possible without totally breaking the pool ? How ?
Do we need to recreate the pg manually (or ceph will do it automatically) ?
Thanks for you help.
F.
Coincidentally Adam on our core team just reported this morning that he
saw extremely high bluestore_cache_other memory usage while running
compression performance tests as well. That may indicate we have a
memory leak related to the compression code. I doubt setting the
memory_target to 3GiB will help in the long run as that will just
attempt to compensate by decreasing the other caches until nothing else
can be shrunk. Adam said he's planning to investigate so hopefully we
will know more soon.
Mark
On 5/13/20 10:52 AM, Rafał Wądołowski wrote:
> Mark,
> Unfortunetly I closed terminal with mempool. But there was a lot of
> bytes used by bluestore_cache_other. That was the highest value (about
> 85%). The onode cache takes about 10%. PGlog and osdmaps was okey, low
> values. I saw some ideas that maybe compression_mode force in pool can
> make a mess.
> One more thing, we are running stupid allocator. Right now I am
> decrease the osd_memory_target to 3GiB and will wait if ram problem
> occurs.
>
>
>
> Regards,
>
> */Rafał Wądołowski/*
>
> ------------------------------------------------------------------------
> *From:* Mark Nelson <mnelson(a)redhat.com>
> *Sent:* Wednesday, May 13, 2020 3:30 PM
> *To:* ceph-users(a)ceph.io <ceph-users(a)ceph.io>
> *Subject:* [ceph-users] Re: Memory usage of OSD
> On 5/13/20 12:43 AM, Rafał Wądołowski wrote:
> > Hi,
> > I noticed strange situation in one of our clusters. The OSD deamons
> are taking too much RAM.
> > We are running 12.2.12 and have default configuration of
> osd_memory_target (4GiB).
> > Heap dump shows:
> >
> > osd.2969 dumping heap profile now.
> > ------------------------------------------------
> > MALLOC: 6381526944 ( 6085.9 MiB) Bytes in use by application
> > MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
> > MALLOC: + 173373288 ( 165.3 MiB) Bytes in central cache freelist
> > MALLOC: + 17163520 ( 16.4 MiB) Bytes in transfer cache freelist
> > MALLOC: + 95339512 ( 90.9 MiB) Bytes in thread cache freelists
> > MALLOC: + 28995744 ( 27.7 MiB) Bytes in malloc metadata
> > MALLOC: ------------
> > MALLOC: = 6696399008 ( 6386.2 MiB) Actual memory used (physical +
> swap)
> > MALLOC: + 218267648 ( 208.2 MiB) Bytes released to OS (aka unmapped)
> > MALLOC: ------------
> > MALLOC: = 6914666656 ( 6594.3 MiB) Virtual address space used
> > MALLOC:
> > MALLOC: 408276 Spans in use
> > MALLOC: 75 Thread heaps in use
> > MALLOC: 8192 Tcmalloc page size
> > ------------------------------------------------
> > Call ReleaseFreeMemory() to release freelist memory to the OS (via
> madvise()).
> > Bytes released to the OS take up virtual address space but no
> physical memory.
> >
> > IMO "Bytes in use by application" should be less than
> osd_memory_target. Am I correct?
> > I checked heap dump with google-pprof and got following results.
> > Total: 149.4 MB
> > 60.5 40.5% 40.5% 60.5 40.5%
> rocksdb::UncompressBlockContentsForCompressionType
> > 34.2 22.9% 63.4% 34.2 22.9%
> ceph::buffer::create_aligned_in_mempool
> > 11.9 7.9% 71.3% 12.1 8.1%
> std::_Rb_tree::_M_emplace_hint_unique
> > 10.7 7.1% 78.5% 71.2 47.7% rocksdb::ReadBlockContents
> >
> > Does it mean that most of RAM is used by rocksdb?
>
>
> It looks like your heap dump is only accounting for 149.4MB of the
> memory so probably not representative across the whole ~6.5G. Instead
> could you try dumping the mempools via "ceph daemon osd.2969
> dump_mempools"?
>
>
> >
> > How can I take a deeper look into memory usage ?
>
>
> Beyond looking at the mempools, you can see the bluestore cache
> allocation information by either enabling debug bluestore and debug
> priority_cache_manager 5, or potentially looking at the PCM perf
> counters (I'm not sure if those were in 14.2.12 though). Between the
> heap data, mempool data, and priority cache records, it should become
> clearer what's going on.
>
>
> Mark
>
>
> >
> >
> > Regards,
> >
> > Rafał Wądołowski
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
hi,every one,
my ceph version 12.2.12,I want to set require min compat client
luminous,I use command
#ceph osd set-require-min-compat-client luminous
but ceph report:Error EPERM: cannot set require_min_compat_client to
luminous: 4 connected client(s) look like jewel (missing
0xa00000000200000); add --yes-i-really-mean-it to do it anyway
[root@node-1 ~]# ceph features
{
"mon": {
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 3
}
},
"osd": {
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 15
}
},
"client": {
"group": {
"features": "0x40106b84a842a52",
"release": "jewel",
"num": 4
},
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 168
}
}
}
so,I run command:
[root@node-1 gyt]# ceph osd set-require-min-compat-client luminous
--yes-i-really-mean-it
set require_min_compat_client to luminous
but now,I want to set require min compat client jewel,I use command:
[root@node-1 gyt]# ceph osd set-require-min-compat-client jewel
Error EPERM: osdmap current utilizes features that require luminous;
cannot set require_min_compat_client below that to jewel
what‘s the way we are set luminous chang to jewel?
Hi,
I deployed a multi-site in order to sync data from a cluster
to anther. The data is fully synced(I suppose) and the cluster
has no traffic at present. Everything seems fine.
However, the sync status is not what I expected. Is there
any step after data transfer? Can I change the master zone
to my new zone? Can I stop the old cluster?
sudo radosgw-admin sync status
realm bde4bb56-fbca-4ef8-a979-935dbf109b78 (cn)
zonegroup d25ae683-cdb8-4227-be45-ebaf0aed6050 (beijing)
zone 313c8244-fe4d-4d46-bf9b-0e33e46be041 (newzone)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: f70a5eb9-d88d-42fd-ab4e-d300e97094de (oldzone)
syncing
full sync: 1/128 shards
full sync: 0 buckets to sync
incremental sync: 127/128 shards
data is behind on 14 shards
behind shards:
[3,21,42,54,55,62,71,75,92,95,104,106,108,122]
On 5/13/20 12:43 AM, RafaĹ WÄ doĹowski wrote:
> Hi,
> I noticed strange situation in one of our clusters. The OSD deamons are taking too much RAM.
> We are running 12.2.12 and have default configuration of osd_memory_target (4GiB).
> Heap dump shows:
>
> osd.2969 dumping heap profile now.
> ------------------------------------------------
> MALLOC: 6381526944 ( 6085.9 MiB) Bytes in use by application
> MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
> MALLOC: + 173373288 ( 165.3 MiB) Bytes in central cache freelist
> MALLOC: + 17163520 ( 16.4 MiB) Bytes in transfer cache freelist
> MALLOC: + 95339512 ( 90.9 MiB) Bytes in thread cache freelists
> MALLOC: + 28995744 ( 27.7 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 6696399008 ( 6386.2 MiB) Actual memory used (physical + swap)
> MALLOC: + 218267648 ( 208.2 MiB) Bytes released to OS (aka unmapped)
> MALLOC: ------------
> MALLOC: = 6914666656 ( 6594.3 MiB) Virtual address space used
> MALLOC:
> MALLOC: 408276 Spans in use
> MALLOC: 75 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
> Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
> Bytes released to the OS take up virtual address space but no physical memory.
>
> IMO "Bytes in use by application" should be less than osd_memory_target. Am I correct?
> I checked heap dump with google-pprof and got following results.
> Total: 149.4 MB
> 60.5 40.5% 40.5% 60.5 40.5% rocksdb::UncompressBlockContentsForCompressionType
> 34.2 22.9% 63.4% 34.2 22.9% ceph::buffer::create_aligned_in_mempool
> 11.9 7.9% 71.3% 12.1 8.1% std::_Rb_tree::_M_emplace_hint_unique
> 10.7 7.1% 78.5% 71.2 47.7% rocksdb::ReadBlockContents
>
> Does it mean that most of RAM is used by rocksdb?
It looks like your heap dump is only accounting for 149.4MB of the
memory so probably not representative across the whole ~6.5G. Instead
could you try dumping the mempools via "ceph daemon osd.2969 dump_mempools"?
>
> How can I take a deeper look into memory usage ?
Beyond looking at the mempools, you can see the bluestore cache
allocation information by either enabling debug bluestore and debug
priority_cache_manager 5, or potentially looking at the PCM perf
counters (I'm not sure if those were in 14.2.12 though). Between the
heap data, mempool data, and priority cache records, it should become
clearer what's going on.
Mark
>
>
> Regards,
>
> RafaĹ WÄ doĹowski
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi,
I have created an erasure coded pool and the below default parameters
related to stripe sizes are present.
"osd_pool_erasure_code_stripe_width": "4096" --> 4KB
"rgw_obj_stripe_size": "4194304" --> 4MB
Let say the k+m values are 10+5 for the erasure pool, and we upload an
object of let say, size <4MB & another object of size >4MB. How will ceph
break the object into chunks and store it & how many stripes will be
created?
Regards,
Biswajeet
--
*-----------------------------------------------------------------------------------------*
*This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error, please notify the
system manager. This message contains confidential information and is
intended only for the individual named. If you are not the named addressee,
you should not disseminate, distribute or copy this email. Please notify
the sender immediately by email if you have received this email by mistake
and delete this email from your system. If you are not the intended
recipient, you are notified that disclosing, copying, distributing or
taking any action in reliance on the contents of this information is
strictly prohibited.*****
****
*Any views or opinions presented in this
email are solely those of the author and do not necessarily represent those
of the organization. Any information on shares, debentures or similar
instruments, recommended product pricing, valuations and the like are for
information purposes only. It is not meant to be an instruction or
recommendation, as the case may be, to buy or to sell securities, products,
services nor an offer to buy or sell securities, products or services
unless specifically stated to be so on behalf of the Flipkart group.
Employees of the Flipkart group of companies are expressly required not to
make defamatory statements and not to infringe or authorise any
infringement of copyright or any other legal right by email communications.
Any such communication is contrary to organizational policy and outside the
scope of the employment of the individual concerned. The organization will
not accept any liability in respect of such communication, and the employee
responsible will be personally liable for any damages or other liability
arising.*****
****
*Our organization accepts no liability for the
content of this email, or for the consequences of any actions taken on the
basis of the information *provided,* unless that information is
subsequently confirmed in writing. If you are not the intended recipient,
you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly
prohibited.*
_-----------------------------------------------------------------------------------------_
Hi
On one of our Ceph clusters, some OSDs have been marked as full. Since this is a staging cluster that does not have much data on it, this is strange.
Looking at the full OSDs through “ceph osd df” I figured out that the space is mostly used by metadata:
SIZE: 122 GiB
USE: 118 GiB
DATA: 2.4 GiB
META: 116 GiB
We run mimic, and for the affected OSDs we use a db device (nvme) in addition to the primary device (hdd).
In the logs we see the following errors:
2020-05-12 17:10:26.089 7f183f604700 1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0; fallback to bdev 2
2020-05-12 17:10:27.113 7f183f604700 1 bluestore(/var/lib/ceph/osd/ceph-8) _balance_bluefs_freespace gifting 0x180a000000~400000 to bluefs
2020-05-12 17:10:27.153 7f183f604700 1 bluefs add_block_extent bdev 2 0x180a000000~400000
We assume it is an issue with Rocksdb, as the following call will quickly fix the problem:
ceph daemon osd.8 compact
The question is, why is this happening? I would think that “compact" is something that runs automatically from time to time, but I’m not sure.
Is it on us to run this regularly?
Any pointers are welcome. I’m quite new to Ceph :)
Cheers,
Denis