playing with MULTI-SITE zones for CEPH Object Gateway
ceph version: 17.2.5
my setup: 3 zone multi-site; 3-way full sync mode;
each zone has 3 machines -> RGW+MON+OSD
running load test: 3000 concurrent uploads of 1M object
after about 3-4 minutes of load RGW machine get stuck, on 2 zone out of 3 RGW is not responding (e.g. curl $RGW:80)
attempt to restart RGW ends up with `Initialization timeout, failed to initialize`
here is a backtrace from gdb with a backtrace where it hangs after restart:
(gdb) inf thr
Id Target Id Frame
* 1 Thread 0x7fa7d3abbcc0 (LWP 30791) "radosgw" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffc7f7a2438) at ../sysdeps/nptl/futex-internal.h:183
...
(gdb) bt
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffc7f7a2438) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffc7f7a2488, cond=0x7ffc7f7a2410) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x7ffc7f7a2410, mutex=0x7ffc7f7a2488) at pthread_cond_wait.c:647
#3 0x00007fa7d7097e42 in ceph::condition_variable_debug::wait (this=this@entry=0x7ffc7f7a2410, lock=...) at ../src/common/mutex_debug.h:148
#4 0x00007fa7d7953cba in ceph::condition_variable_debug::wait<librados::IoCtxImpl::operate(const object_t&, ObjectOperation*, ceph::real_time*, int)::<lambda()> > (pred=..., lock=..., this=0x7ffc7f7a2410) at ../src/librados/IoCtxImpl.cc:672
#5 librados::IoCtxImpl::operate (this=this@entry=0x558347c21010, oid=..., o=0x558347e12310, pmtime=<optimized out>, flags=<optimized out>) at ../src/librados/IoCtxImpl.cc:672
#6 0x00007fa7d792bd55 in librados::v14_2_0::IoCtx::operate (this=this@entry=0x558347e44760, oid="notify.0", o=o@entry=0x7ffc7f7a2690, flags=flags@entry=0) at ../src/librados/librados_cxx.cc:1536
#7 0x00007fa7d9490ad1 in rgw_rados_operate (dpp=<optimized out>, ioctx=..., oid="notify.0", op=op@entry=0x7ffc7f7a2690, y=..., flags=0) at ../src/rgw/rgw_tools.cc:277
#8 0x00007fa7d9627e0f in RGWSI_RADOS::Obj::operate (this=this@entry=0x558347e44710, dpp=<optimized out>, op=op@entry=0x7ffc7f7a2690, y=..., flags=flags@entry=0) at ../src/rgw/services/svc_rados.h:112
#9 0x00007fa7d96209a5 in RGWSI_Notify::init_watch (this=this@entry=0x558347c49530, dpp=<optimized out>, y=...) at ../src/rgw/services/svc_notify.cc:214
#10 0x00007fa7d962161b in RGWSI_Notify::do_start (this=0x558347c49530, y=..., dpp=<optimized out>) at ../src/rgw/services/svc_notify.cc:277
#11 0x00007fa7d8f17bcf in RGWServiceInstance::start (this=0x558347c49530, y=..., dpp=<optimized out>) at ../src/rgw/rgw_service.cc:331
#12 0x00007fa7d8f1a260 in RGWServices_Def::init (this=this@entry=0x558347de90a0, cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false, run_sync=<optimized out>, y=..., dpp=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:360
#13 0x00007fa7d8f1cc40 in RGWServices::do_init (this=this@entry=0x558347de90a0, _cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false, run_sync=<optimized out>, y=..., dpp=<optimized out>) at ../src/rgw/rgw_service.cc:284
#14 0x00007fa7d92a7b1f in RGWServices::init (dpp=<optimized out>, y=..., run_sync=<optimized out>, have_cache=<optimized out>, cct=<optimized out>, this=0x558347de90a0) at ../src/rgw/rgw_service.h:153
#15 RGWRados::init_svc (this=this@entry=0x558347de8dc0, raw=raw@entry=false, dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1380
#16 0x00007fa7d930f241 in RGWRados::initialize (this=0x558347de8dc0, dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1400
#17 0x00007fa7d944f85f in RGWRados::initialize (dpp=<optimized out>, _cct=0x558347c6a320, this=<optimized out>) at ../src/rgw/rgw_rados.h:586
#18 StoreManager::init_storage_provider (dpp=<optimized out>, dpp@entry=0x7ffc7f7a2e90, cct=cct@entry=0x558347c6a320, svc="rados", use_gc_thread=use_gc_thread@entry=true, use_lc_thread=use_lc_thread@entry=true, quota_threads=quota_threads@entry=true, run_sync_thread=true, run_reshard_thread=true, use_cache=true,
use_gc=true) at ../src/rgw/rgw_sal.cc:55
#19 0x00007fa7d8e7367a in StoreManager::get_storage (use_gc=true, use_cache=true, run_reshard_thread=true, run_sync_thread=true, quota_threads=true, use_lc_thread=true, use_gc_thread=true, svc="rados", cct=0x558347c6a320, dpp=0x7ffc7f7a2e90) at /usr/include/c++/9/bits/basic_string.h:267
#20 radosgw_Main (argc=<optimized out>, argv=<optimized out>) at ../src/rgw/rgw_main.cc:372
#21 0x0000558347883f56 in main (argc=<optimized out>, argv=<optimized out>) at ../src/rgw/radosgw.cc:12
(gdb)
#0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffc7f7a2438) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffc7f7a2488, cond=0x7ffc7f7a2410) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x7ffc7f7a2410, mutex=0x7ffc7f7a2488) at pthread_cond_wait.c:647
#3 0x00007fa7d7097e42 in ceph::condition_variable_debug::wait (this=this@entry=0x7ffc7f7a2410, lock=...) at ../src/common/mutex_debug.h:148
#4 0x00007fa7d7953cba in ceph::condition_variable_debug::wait<librados::IoCtxImpl::operate(const object_t&, ObjectOperation*, ceph::real_time*, int)::<lambda()> > (pred=..., lock=..., this=0x7ffc7f7a2410) at ../src/librados/IoCtxImpl.cc:672
#5 librados::IoCtxImpl::operate (this=this@entry=0x558347c21010, oid=..., o=0x558347e12310, pmtime=<optimized out>, flags=<optimized out>) at ../src/librados/IoCtxImpl.cc:672
#6 0x00007fa7d792bd55 in librados::v14_2_0::IoCtx::operate (this=this@entry=0x558347e44760, oid="notify.0", o=o@entry=0x7ffc7f7a2690, flags=flags@entry=0) at ../src/librados/librados_cxx.cc:1536
#7 0x00007fa7d9490ad1 in rgw_rados_operate (dpp=<optimized out>, ioctx=..., oid="notify.0", op=op@entry=0x7ffc7f7a2690, y=..., flags=0) at ../src/rgw/rgw_tools.cc:277
#8 0x00007fa7d9627e0f in RGWSI_RADOS::Obj::operate (this=this@entry=0x558347e44710, dpp=<optimized out>, op=op@entry=0x7ffc7f7a2690, y=..., flags=flags@entry=0) at ../src/rgw/services/svc_rados.h:112
#9 0x00007fa7d96209a5 in RGWSI_Notify::init_watch (this=this@entry=0x558347c49530, dpp=<optimized out>, y=...) at ../src/rgw/services/svc_notify.cc:214
#10 0x00007fa7d962161b in RGWSI_Notify::do_start (this=0x558347c49530, y=..., dpp=<optimized out>) at ../src/rgw/services/svc_notify.cc:277
#11 0x00007fa7d8f17bcf in RGWServiceInstance::start (this=0x558347c49530, y=..., dpp=<optimized out>) at ../src/rgw/rgw_service.cc:331
#12 0x00007fa7d8f1a260 in RGWServices_Def::init (this=this@entry=0x558347de90a0, cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false, run_sync=<optimized out>, y=..., dpp=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:360
#13 0x00007fa7d8f1cc40 in RGWServices::do_init (this=this@entry=0x558347de90a0, _cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false, run_sync=<optimized out>, y=..., dpp=<optimized out>) at ../src/rgw/rgw_service.cc:284
#14 0x00007fa7d92a7b1f in RGWServices::init (dpp=<optimized out>, y=..., run_sync=<optimized out>, have_cache=<optimized out>, cct=<optimized out>, this=0x558347de90a0) at ../src/rgw/rgw_service.h:153
#15 RGWRados::init_svc (this=this@entry=0x558347de8dc0, raw=raw@entry=false, dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1380
#16 0x00007fa7d930f241 in RGWRados::initialize (this=0x558347de8dc0, dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1400
#17 0x00007fa7d944f85f in RGWRados::initialize (dpp=<optimized out>, _cct=0x558347c6a320, this=<optimized out>) at ../src/rgw/rgw_rados.h:586
#18 StoreManager::init_storage_provider (dpp=<optimized out>, dpp@entry=0x7ffc7f7a2e90, cct=cct@entry=0x558347c6a320, svc="rados", use_gc_thread=use_gc_thread@entry=true, use_lc_thread=use_lc_thread@entry=true, quota_threads=quota_threads@entry=true, run_sync_thread=true, run_reshard_thread=true, use_cache=true,
use_gc=true) at ../src/rgw/rgw_sal.cc:55
#19 0x00007fa7d8e7367a in StoreManager::get_storage (use_gc=true, use_cache=true, run_reshard_thread=true, run_sync_thread=true, quota_threads=true, use_lc_thread=true, use_gc_thread=true, svc="rados", cct=0x558347c6a320, dpp=0x7ffc7f7a2e90) at /usr/include/c++/9/bits/basic_string.h:267
#20 radosgw_Main (argc=<optimized out>, argv=<optimized out>) at ../src/rgw/rgw_main.cc:372
#21 0x0000558347883f56 in main (argc=<optimized out>, argv=<optimized out>) at ../src/rgw/radosgw.cc:12
Any suggestion on what can be a problem and how to reset RGW so it will be able to start normally?
Hello List,
i made a mistake, draining a host instead of entering it into Maintenance
Mode (for OS reboot). :-/
After "Stop Drain" and restore of original "crush reweight" values, so far
everything looks fine.
cluster:
health: HEALTH_OK
services:
[..]
osd: 79 osds: 78 up (since 3h), 78 in (since 6w); 166 remapped pgs
[..]
And some minor objects being misplaced..
But I can't remove/solve this annoying message "deleting" on these host
OSDs.
[image: Auswahl_2023-05-03_14-17.png]
Does someone have a hint for me?
Thanks,
Christoph
Dear Cephers,
We are planing the dist upgrade from Octopus to Quincy in the next weeks.
The first step is the linux version upgrade from Ubuntu 18.04 to Ubuntu 20.04 from some big ODS servers runnign this OS version.
we just have a look at ( Upgrading non-cephadm clusters [ https://ceph.io/en/news/blog/2022/v17-2-0-quincy-released/#upgrading-non-ce… | ¶ ] ):
https://ceph.io/en/news/blog/2022/v17-2-0-quincy-released/
Is any advise or suggestion before to start the procedure ?
regards, I
--
================================================================
Ibán Cabrillo Bartolomé
Instituto de Física de Cantabria (IFCA-CSIC)
Santander, Spain
Tel: +34942200969/+34669930421
Responsible for advanced computing service (RSC)
=========================================================================================
=========================================================================================
All our suppliers must know and accept IFCA policy available at:
https://confluence.ifca.es/display/IC/Information+Security+Policy+for+Exter…
==========================================================================================
Hi all,
On one server with a cache tier on Samsung PM983 SSDs for an EC base
tier on HDDs, I find the cache tier stops flushing or evicting when the
cache tier is near full. With quite some gdb-debugging, I find the
problem may be with the throttling mechanism. When the write traffic is
high, the cache tier quickly fills its maximum request count and
throttles further requests. Then flush stops because copy-from requests
are throttled by the cache tier OSD. Ironically, the 256 requests
already accepted by the cache tier cannot proceed, either, because the
cache tier is full and cannot flush/evict.
While we may advise cache tier should not go full, this deadlock
situation is not entirely comprehensible to me because a full cache
usually can flush/evict as long as the base tier has space.
I wonder whether there has been some specific reasons for this behavior.
My test environment is with version 15.2.17 but the code in 17.2.2
appears to handle this part of logic in the same way.
Cheers,
lin
Hi,
When using rbd mirroring, the mirroring concerns the images only, not the
whole pool? So, we don't need to have a dedicated pool in the destination
site to be mirrored, the only obligation is that the mirrored pools must
have the same name.
In other words, We create two pools with the same name, one on the source
site the other on the destination site, we create the mirror link (one way
or two ways replication), then we choose what images to sync.
Both pools can be used simultaneously on both sites, it's the mirrored
images that cannot be used simultaneously, only promoted ones.
Is this correct?
Regards.
Hey guys and girls,
I'm working on a project to build storage for one of our departments,
and I want to ask you guys and girls for input on the high-level
overview part. It's a long one, I hope you read along and comment.
SUMMARY
I made a plan last year to build a 'storage solution' including ceph
and some windows VM's to expose the data over SMB to clients. A year
later I finally have the hardware, built a ceph cluster, and I'm doing
tests. Ceph itself runs great, but when I wanted to start exposing the
data using iscsi to our VMware farm, I ran into some issues. I know
the iscsi gateways will introduce some new performance bottlenecks,
but I'm seeing really slow performance, still working on that.
But then I ran into the warning on the iscsi gateway page: "The iSCSI
gateway is in maintenance as of November 2022. This means that it is
no longer in active development and will not be updated to add new
features.". Wait, what? Why!? What does this mean? Does this mean that
iSCSI is now 'feature complete' and will still be supported the next 5
years, or will it be deprecated in the future? I tried searching, but
couldn't find any info on the decision and the roadmap.
My goal is to build a future-proof setup, and using deprecated
components should not be part of that of course.
If the iscsi gateway will still be supported the next few years and I
can iron out the performance issues, I can still go on with my
original plan. If not, I have to go back to the drawing board. And
maybe you guys would advise me to take another route anyway.
GOALS
My goals/considerations are:
- we want >1PB of storage capacity for cheap (on a tight budget) for
research data. Most of it is 'store once, read sometimes'. <1% of the
data is 'hot'.
- focus is on capacity, but it would be nice to have > 200MB/s of
sequential write/read performance and not 'totally suck' on random
i/o. Yes, not very well quantified, but ah. Sequential writes are most
important.
- end users all run Windows computers (mostly VDI's) and a lot of
applications require SMB shares.
- security is a big thing, we want really tight ACL's, specific
monitoring agents, etc.
- our data is incredibly important to us, we still want the 3-2-1
backup rule. Primary storage solution, a second storage solution in a
different place, and some of the data that is not reproducible is also
written to tape. We also want to be protected from ransomware or user
errors (so no direct replication to the second storage).
- I like open source, reliability, no fork-lift upgrades, no vendor
lock-in, blah, well, I'm on the ceph list here, no need to convince
you guys ;)
- We're hiring a commercial company to do ceph maintenance and support
for when I'm on leave or leaving the company, but they won't support
clients, backup software, etc, so I want something as simple as
possible. We do have multiple Windows/VMware admins, but no other real
linux guru's.
THE INITIAL PLAN
Given these considerations, I ordered two identical clusters, each
consisting of 3 monitor nodes and 8 osd nodes, Each osd node has 2
ssd's and 10 capacity disks (EC 4:2 for the data), and each node is
connected using a 2x25Gbps bond. Ceph is running like a charm. Now I
just have to think about exposing the data to end users, and I've been
testing different setups.
My original plan was to expose for example 10x100TB rbd images using
iSCSI to our VMware farm, formatting the luns with VMFS6, and run for
example 2 Windows file servers per datastore on that with a single DFS
namespace to end users. Then backup the file servers using our
existing Veeam infrastructure to RGW running on the second cluster
with an immutable bucket. This way we would have easily defined
security boundaries: the clients can only reach the file servers, the
file servers only see their local VMDK's, ESX only sees the luns on
the iSCSI target, etc. When a file server would be compromised, it
would have no access to ceph. We have easy incremental backups,
immutability for ransomware protection, etc. And the best part is that
the ceph admin can worry about ceph, the vmware admin can focus on
ESX, VMFS and all the vmware stuff, and the Windows admins can focus
on the Windows boxes, Windows-specific ACLS and tools and Veeam
backups and stuff.
CURRENT SITUATION
I'm building out this plan now, but I'm running into issues with
iSCSI. Are any of you doing something similar? What is your iscsi
performance compared to direct rbd?
In regard to performance: If I take 2 test windows VM's, I put one on
an iSCSI datastore and another with direct rbd access using the
windows rbd driver, I create a share on those boxes and push data to
it, I see different results (of course). Copying some iso images over
SMB to the 'windows vm running direct rbd' I see around 800MB/s write,
and 200MB/s read, which is pretty okay. When I send data to the
'windows vm running on top of iscsi' it starts writing at around
350MB/s, but after like 10-20 seconds drops to 100MB/s and won't go
faster. Reads are anywhere from 40MB/s to 80MB/s, which is not really
acceptable.
Another really viable and performant scenario would be to have the
Windows file servers connect to rbd directly with the windows rbd
driver. It seems to work well, it's fast, and you don't have the
bottleneck that the iscsi gateway creates. But I see this driver is
still in beta. Is anyone using this in production? What are your
experiences? We would miss out on the separation of layers and thus
have less security, but at the same time, it really increases
efficiency and performance.
And if I use rbd, then vmware won't see the storage, and I cannot do
an image backup using veeam. I could of course do backups of the rbd
images, using tools like restic or backy to rgw running on the second
cluster with immutable buckets. What are your experiences? Is it easy
to do differential backups of lots of 50TB rbd images? Change rate is
usually like 0.005% per day or something ;)
By the way, we also thought about CephFS, but we have some complex
stuff going on with extended ACL's that I don't think will play nice
with CephFS, and I think it's a lot more complex to backup CephFS than
block images.
If you made it here, thank you for your time! I hope you can share
thoughts on my questions!
Angelo.
Hi folks,
With a multi-site environment, when I create a bucket-level sync policy with a symmetric flow between the master zone and another zone, "bucket sync status" immediately shows that the sync is now enabled in the master zone. But it takes a while for it to show that in the other zone. I tried "period pull" at the other zone and "period push" at the master zone. Neither seem to make a difference. Is there a way to speed up this process?
Thanks,
Yixin
In 17.2.6 is there a security requirement that pool names supporting a
ceph fs filesystem match the filesystem name.data for the data and
name.meta for the associated metadata pool? (multiple file systems are
enabled)
I have filesystems from older versions with the data pool name matching
the filesystem and appending _metadata for that,
and even older filesystems with the pool name as in 'library' and
'library_metadata' supporting a filesystem called 'libraryfs'
The pools all have the cephfs tag.
But using the documented:
ceph fs authorize libraryfs client.basicuser / rw
command allows the root user to mount and browse the library directory
tree, but fails with 'operation not permitted' when even reading any file.
However, changing the client.basicuser osd auth to 'allow rw' instead of
'allow rw tag...' allows normal operations.
So:
[client.basicuser]
key = <key stuff>==
caps mds = "allow rw fsname=libraryfs"
caps mon = "allow r fsname=libraryfs"
caps osd = "allow rw"
works, but the same with
caps osd = "allow rw tag cephfs data=libraryfs"
leads to the 'operation not permitted' on read, or write or any actual
access.
It remains a puzzle. Help appreciated!
Were there upgrade instructions about that, any help pointing me to them?
Thanks
Harry Coin
Rock Stable Systems