playing with MULTI-SITE zones for CEPH Object Gateway
ceph version: 17.2.5
my setup: 3 zone multi-site; 3-way full sync mode;
each zone has 3 machines -> RGW+MON+OSD
running load test: 3000 concurrent uploads of 1M object
after about 3-4 minutes of load RGW machine get stuck, on 2 zone out of 3 RGW is not
responding (e.g. curl $RGW:80)
attempt to restart RGW ends up with `Initialization timeout, failed to initialize`
here is a backtrace from gdb with a backtrace where it hangs after restart:
(gdb) inf thr
Id Target Id Frame
* 1 Thread 0x7fa7d3abbcc0 (LWP 30791) "radosgw" futex_wait_cancelable
(private=<optimized out>, expected=0, futex_word=0x7ffc7f7a2438) at
../sysdeps/nptl/futex-internal.h:183
...
(gdb) bt
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x7ffc7f7a2438) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffc7f7a2488,
cond=0x7ffc7f7a2410) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x7ffc7f7a2410, mutex=0x7ffc7f7a2488) at
pthread_cond_wait.c:647
#3 0x00007fa7d7097e42 in ceph::condition_variable_debug::wait
(this=this@entry=0x7ffc7f7a2410, lock=...) at ../src/common/mutex_debug.h:148
#4 0x00007fa7d7953cba in
ceph::condition_variable_debug::wait<librados::IoCtxImpl::operate(const object_t&,
ObjectOperation*, ceph::real_time*, int)::<lambda()> > (pred=..., lock=...,
this=0x7ffc7f7a2410) at ../src/librados/IoCtxImpl.cc:672
#5 librados::IoCtxImpl::operate (this=this@entry=0x558347c21010, oid=...,
o=0x558347e12310, pmtime=<optimized out>, flags=<optimized out>) at
../src/librados/IoCtxImpl.cc:672
#6 0x00007fa7d792bd55 in librados::v14_2_0::IoCtx::operate
(this=this@entry=0x558347e44760, oid="notify.0", o=o@entry=0x7ffc7f7a2690,
flags=flags@entry=0) at ../src/librados/librados_cxx.cc:1536
#7 0x00007fa7d9490ad1 in rgw_rados_operate (dpp=<optimized out>, ioctx=...,
oid="notify.0", op=op@entry=0x7ffc7f7a2690, y=..., flags=0) at
../src/rgw/rgw_tools.cc:277
#8 0x00007fa7d9627e0f in RGWSI_RADOS::Obj::operate (this=this@entry=0x558347e44710,
dpp=<optimized out>, op=op@entry=0x7ffc7f7a2690, y=..., flags=flags@entry=0) at
../src/rgw/services/svc_rados.h:112
#9 0x00007fa7d96209a5 in RGWSI_Notify::init_watch (this=this@entry=0x558347c49530,
dpp=<optimized out>, y=...) at ../src/rgw/services/svc_notify.cc:214
#10 0x00007fa7d962161b in RGWSI_Notify::do_start (this=0x558347c49530, y=...,
dpp=<optimized out>) at ../src/rgw/services/svc_notify.cc:277
#11 0x00007fa7d8f17bcf in RGWServiceInstance::start (this=0x558347c49530, y=...,
dpp=<optimized out>) at ../src/rgw/rgw_service.cc:331
#12 0x00007fa7d8f1a260 in RGWServices_Def::init (this=this@entry=0x558347de90a0,
cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false,
run_sync=<optimized out>, y=..., dpp=<optimized out>) at
/usr/include/c++/9/bits/unique_ptr.h:360
#13 0x00007fa7d8f1cc40 in RGWServices::do_init (this=this@entry=0x558347de90a0,
_cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false,
run_sync=<optimized out>, y=..., dpp=<optimized out>) at
../src/rgw/rgw_service.cc:284
#14 0x00007fa7d92a7b1f in RGWServices::init (dpp=<optimized out>, y=...,
run_sync=<optimized out>, have_cache=<optimized out>, cct=<optimized
out>, this=0x558347de90a0) at ../src/rgw/rgw_service.h:153
#15 RGWRados::init_svc (this=this@entry=0x558347de8dc0, raw=raw@entry=false,
dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1380
#16 0x00007fa7d930f241 in RGWRados::initialize (this=0x558347de8dc0, dpp=<optimized
out>) at ../src/rgw/rgw_rados.cc:1400
#17 0x00007fa7d944f85f in RGWRados::initialize (dpp=<optimized out>,
_cct=0x558347c6a320, this=<optimized out>) at ../src/rgw/rgw_rados.h:586
#18 StoreManager::init_storage_provider (dpp=<optimized out>,
dpp@entry=0x7ffc7f7a2e90, cct=cct@entry=0x558347c6a320, svc="rados",
use_gc_thread=use_gc_thread@entry=true, use_lc_thread=use_lc_thread@entry=true,
quota_threads=quota_threads@entry=true, run_sync_thread=true, run_reshard_thread=true,
use_cache=true,
use_gc=true) at ../src/rgw/rgw_sal.cc:55
#19 0x00007fa7d8e7367a in StoreManager::get_storage (use_gc=true, use_cache=true,
run_reshard_thread=true, run_sync_thread=true, quota_threads=true, use_lc_thread=true,
use_gc_thread=true, svc="rados", cct=0x558347c6a320, dpp=0x7ffc7f7a2e90) at
/usr/include/c++/9/bits/basic_string.h:267
#20 radosgw_Main (argc=<optimized out>, argv=<optimized out>) at
../src/rgw/rgw_main.cc:372
#21 0x0000558347883f56 in main (argc=<optimized out>, argv=<optimized out>) at
../src/rgw/radosgw.cc:12
(gdb)
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x7ffc7f7a2438) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffc7f7a2488,
cond=0x7ffc7f7a2410) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=cond@entry=0x7ffc7f7a2410, mutex=0x7ffc7f7a2488) at
pthread_cond_wait.c:647
#3 0x00007fa7d7097e42 in ceph::condition_variable_debug::wait
(this=this@entry=0x7ffc7f7a2410, lock=...) at ../src/common/mutex_debug.h:148
#4 0x00007fa7d7953cba in
ceph::condition_variable_debug::wait<librados::IoCtxImpl::operate(const object_t&,
ObjectOperation*, ceph::real_time*, int)::<lambda()> > (pred=..., lock=...,
this=0x7ffc7f7a2410) at ../src/librados/IoCtxImpl.cc:672
#5 librados::IoCtxImpl::operate (this=this@entry=0x558347c21010, oid=...,
o=0x558347e12310, pmtime=<optimized out>, flags=<optimized out>) at
../src/librados/IoCtxImpl.cc:672
#6 0x00007fa7d792bd55 in librados::v14_2_0::IoCtx::operate
(this=this@entry=0x558347e44760, oid="notify.0", o=o@entry=0x7ffc7f7a2690,
flags=flags@entry=0) at ../src/librados/librados_cxx.cc:1536
#7 0x00007fa7d9490ad1 in rgw_rados_operate (dpp=<optimized out>, ioctx=...,
oid="notify.0", op=op@entry=0x7ffc7f7a2690, y=..., flags=0) at
../src/rgw/rgw_tools.cc:277
#8 0x00007fa7d9627e0f in RGWSI_RADOS::Obj::operate (this=this@entry=0x558347e44710,
dpp=<optimized out>, op=op@entry=0x7ffc7f7a2690, y=..., flags=flags@entry=0) at
../src/rgw/services/svc_rados.h:112
#9 0x00007fa7d96209a5 in RGWSI_Notify::init_watch (this=this@entry=0x558347c49530,
dpp=<optimized out>, y=...) at ../src/rgw/services/svc_notify.cc:214
#10 0x00007fa7d962161b in RGWSI_Notify::do_start (this=0x558347c49530, y=...,
dpp=<optimized out>) at ../src/rgw/services/svc_notify.cc:277
#11 0x00007fa7d8f17bcf in RGWServiceInstance::start (this=0x558347c49530, y=...,
dpp=<optimized out>) at ../src/rgw/rgw_service.cc:331
#12 0x00007fa7d8f1a260 in RGWServices_Def::init (this=this@entry=0x558347de90a0,
cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false,
run_sync=<optimized out>, y=..., dpp=<optimized out>) at
/usr/include/c++/9/bits/unique_ptr.h:360
#13 0x00007fa7d8f1cc40 in RGWServices::do_init (this=this@entry=0x558347de90a0,
_cct=<optimized out>, have_cache=<optimized out>, raw=raw@entry=false,
run_sync=<optimized out>, y=..., dpp=<optimized out>) at
../src/rgw/rgw_service.cc:284
#14 0x00007fa7d92a7b1f in RGWServices::init (dpp=<optimized out>, y=...,
run_sync=<optimized out>, have_cache=<optimized out>, cct=<optimized
out>, this=0x558347de90a0) at ../src/rgw/rgw_service.h:153
#15 RGWRados::init_svc (this=this@entry=0x558347de8dc0, raw=raw@entry=false,
dpp=<optimized out>) at ../src/rgw/rgw_rados.cc:1380
#16 0x00007fa7d930f241 in RGWRados::initialize (this=0x558347de8dc0, dpp=<optimized
out>) at ../src/rgw/rgw_rados.cc:1400
#17 0x00007fa7d944f85f in RGWRados::initialize (dpp=<optimized out>,
_cct=0x558347c6a320, this=<optimized out>) at ../src/rgw/rgw_rados.h:586
#18 StoreManager::init_storage_provider (dpp=<optimized out>,
dpp@entry=0x7ffc7f7a2e90, cct=cct@entry=0x558347c6a320, svc="rados",
use_gc_thread=use_gc_thread@entry=true, use_lc_thread=use_lc_thread@entry=true,
quota_threads=quota_threads@entry=true, run_sync_thread=true, run_reshard_thread=true,
use_cache=true,
use_gc=true) at ../src/rgw/rgw_sal.cc:55
#19 0x00007fa7d8e7367a in StoreManager::get_storage (use_gc=true, use_cache=true,
run_reshard_thread=true, run_sync_thread=true, quota_threads=true, use_lc_thread=true,
use_gc_thread=true, svc="rados", cct=0x558347c6a320, dpp=0x7ffc7f7a2e90) at
/usr/include/c++/9/bits/basic_string.h:267
#20 radosgw_Main (argc=<optimized out>, argv=<optimized out>) at
../src/rgw/rgw_main.cc:372
#21 0x0000558347883f56 in main (argc=<optimized out>, argv=<optimized out>) at
../src/rgw/radosgw.cc:12
Any suggestion on what can be a problem and how to reset RGW so it will be able to start
normally?