Rados Crashing - ceph-users

21 Oct 2020

We are performing file maintenance( deletes essentially ) and when the
process gets to a certain point, all four rados gateways crash with the
following:

Log output:

-5> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
verifying op params

    -4> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
pre-executing

    -3> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
executing

    -2> 2020-10-20 06:09:53.997 7f161758f700 10 monclient: get_auth_request
con 0x55d2c02ff800 auth_method 0

    -1> 2020-10-20 06:09:54.009 7f1609d74700  5 process_single_shard():
failed to acquire lock on obj_delete_at_hint.0000000079

     0> 2020-10-20 06:09:54.035 7f15f1543700 -1 *** Caught signal
(Segmentation fault) **

in thread 7f15f1543700 thread_name:civetweb-worker

ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)

1: (()+0xf5d0) [0x7f161d3405d0]

2: (()+0x2bec80) [0x55d2bcd1fc80]

3: (std::string::assign(std::string const&)+0x2e) [0x55d2bcd2870e]

4: (rgw_bucket::operator=(rgw_bucket const&)+0x11) [0x55d2bce3e551]

5: (RGWObjManifest::obj_iterator::update_location()+0x184) [0x55d2bced7114]

6: (RGWObjManifest::obj_iterator::operator++()+0x263) [0x55d2bd092793]

7: (RGWRados::update_gc_chain(rgw_obj&, RGWObjManifest&,
cls_rgw_obj_chain*)+0x51a) [0x55d2bd0939ea]

8: (RGWRados::Object::complete_atomic_modification()+0x83) [0x55d2bd093c63]

9: (RGWRados::Object::Delete::delete_obj()+0x74d) [0x55d2bd0a87ad]

10: (RGWDeleteObj::execute()+0x915) [0x55d2bd04b6d5]

11: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*,
req_state*, bool)+0x915) [0x55d2bcdfbb35]

12: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&,
rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*,
optional_yield, rgw::dmclock::Scheduler*, int*)+0x1cd8) [0x55d2bcdfdea8]

13: (RGWCivetWebFrontend::process(mg_connection*)+0x38e) [0x55d2bcd41a1e]

14: (()+0x36bace) [0x55d2bcdccace]

15: (()+0x36d76f) [0x55d2bcdce76f]

16: (()+0x36dc18) [0x55d2bcdcec18]

17: (()+0x7dd5) [0x7f161d338dd5]

18: (clone()+0x6d) [0x7f161c84302d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.

My guess is that we need to add more resources to the gateways?  They have 2
CPUs and 12GB of memory running as virtual machines on centOS 7.6 .  Any
thoughts?

-Brent