Crash happens in PG::activate, so it's unrelated to IO etc.
My first approach here would be to read the code and try to understand
why it crashes/what the exact condition is that is violated here.
It looks like something that can probably be fixed by fiddling around
with ceph-objectstore-tool (but you should try to understand what
exactly is happening before running random ceph-objectstore-tool
commands)
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at
https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Thu, Feb 27, 2020 at 1:15 PM Lincoln Bryant <lincolnb(a)uchicago.edu> wrote:
>
> Thanks Paul.
>
> I was able to mark many of the unfound ones as lost, but I'm still stuck with one
unfound and OSD assert at this point.
>
> I've tried setting many of the OSD options to pause all cluster I/O, backfilling,
rebalancing, tiering agent, etc to try to avoid hitting the assert but alas this one OSD
is still crashing. The OSD in question does manage to log quite a bit of things before
crashing.
>
> Is there any way for me to delete this or create a dummy object in RADOS that will
let this OSD come up, I wonder?
>
> --Lincoln
>
> OBJECT_UNFOUND 1/793053192 objects unfound (0.000%)
> pg 36.1755 has 1 unfound objects
> PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs down
> pg 36.1153 is down+remapped, acting [299]
> pg 36.2047 is down+remapped, acting [242]
>
> -2> 2020-02-27 06:13:12.265 7f0824f1c700 0 0x55ed866481e0 36.2047 unexpected
need for 36:e2040000:.ceph-internal::hit_set_36.2047_archive_2020-02-25
19%3a32%3a07.171593_2020-02-25 21%3a27%3a36.268116:head have 1363674'2866712 f
> lags = none tried to add 1365222'2867906(1363674'2866712) flags = delete
>
>
> ________________________________
> From: Paul Emmerich <paul.emmerich(a)croit.io>
> Sent: Thursday, February 27, 2020 5:27 AM
> To: Lincoln Bryant <lincolnb(a)uchicago.edu>
> Cc: ceph-users(a)ceph.io <ceph-users(a)ceph.io>
> Subject: Re: [ceph-users] Cache tier OSDs crashing due to unfound hitset object
14.2.7
>
> I've also encountered this issue, but luckily without the crashing
> OSDs, so marking as lost resolved it for us.
>
> See
https://tracker.ceph.com/issues/44286
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at
https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
>
www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Feb 27, 2020 at 6:02 AM Lincoln Bryant <lincolnb(a)uchicago.edu> wrote:
> >
> > Hello Ceph experts,
> >
> > In the last day or so, we had a few nodes randomly reboot and now unfound
objects are reported in Ceph health during cluster during recovery.
> >
> > It appears that the object in question is a hit set object, which I now cannot
mark lost because Ceph cannot probe the OSDs that keep crashing due to missing the hit set
object.
> >
> > Pasted below is the crash message[1] for osd.299, and some of the unfound
objects[2]. Lastly [3] shows a sample of the hit set objects that are lost.
> >
> > I would greatly appreciate any insight you may have on how to move forward. As
of right now this cluster is inoperable due to 3 down PGs.
> >
> > Thanks,
> > Lincoln Bryant
> >
> >
> > [1]
> > -4> 2020-02-26 22:26:29.455 7ff52edaa700 0 0x559587fa91e0 36.321b
unexpected need for 36:d84c0000:.ceph-internal::hit_set_36.321b_archive_2020-02-24
21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 1352209'2834660 flags =
none tried to add 1352209'2834660 flags = none
> > -3> 2020-02-26 22:26:29.455 7ff52edaa700 0 0x559587fa91e0 36.321b
unexpected need for 36:d84c0000:.ceph-internal::hit_set_36.321b_archive_2020-02-24
21%3a15%3a16.792846_2020-02-24 21%3a15%3a32.457855:head have 1352209'2834660 flags =
none tried to add 1359781'2835659 flags = delete
> > -2> 2020-02-26 22:26:29.456 7ff53adc2700 3 osd.299 1367392
handle_osd_map epochs [1367392,1367392], i have 1367392, src has [1349017,1367392]
> > -1> 2020-02-26 22:26:29.460 7ff52edaa700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
In function 'void PG::MissingLoc::add_active_missing(const pg_missing_t&)'
thread 7ff52edaa700 time 2020-02-26 22:26:29.457170
> >
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/PG.h:
838: FAILED ceph_assert(i->second.need == j->second.need)
> >
> > ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)
> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a)
[0x55955fdafc0f]
> > 2: (()+0x4dddd7) [0x55955fdafdd7]
> > 3: (PG::MissingLoc::add_active_missing(pg_missing_set<false>
const&)+0x1e0) [0x55955ffa0cb0]
> > 4: (PG::activate(ObjectStore::Transaction&, unsigned int, std::map<int,
std::map<spg_t, pg_query_t, std::less<spg_t>,
std::allocator<std::pair<spg_t const, pg_query_t> > >,
std::less<int>, std::allocator<std::pair<int const, std::map<spg_t,
pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const,
pg_query_t> > > > > >&, std::map<int,
std::vector<std::pair<pg_notify_t, PastIntervals>,
std::allocator<std::pair<pg_notify_t, PastIntervals> > >,
std::less<int>, std::allocator<std::pair<int const,
std::vector<std::pair<pg_notify_t, PastIntervals>,
std::allocator<std::pair<pg_notify_t, PastIntervals> > > > > >*,
PG::RecoveryCtx*)+0x1916) [0x55955ff3f1e6]
> > 5:
(PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::my_context)+0x370) [0x55955ff62d20]
> > 6: (boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xfb) [0x55955ffa8d5b]
> > 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_queued_events()+0x97)
[0x55955ff88507]
> > 8: (PG::handle_activate_map(PG::RecoveryCtx*)+0x1a8) [0x55955ff75848]
> > 9: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*)+0x61d) [0x55955feb161d]
> > 10: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa6) [0x55955feb2d16]
> > 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x51) [0x55956011a481]
> > 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55955fea7bbf]
> > 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x559560448976]
> > 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55956044b490]
> > 15: (()+0x7e25) [0x7ff5669bae25]
> > 16: (clone()+0x6d) [0x7ff565a9a34d]
> >
> > 0> 2020-02-26 22:26:29.465 7ff52edaa700 -1 *** Caught signal (Aborted)
**
> > in thread 7ff52edaa700 thread_name:tp_osd_tp
> >
> > ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)
> > 1: (()+0xf5e0) [0x7ff5669c25e0]
> > 2: (gsignal()+0x37) [0x7ff5659d71f7]
> > 3: (abort()+0x148) [0x7ff5659d88e8]
> > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199)
[0x55955fdafc5e]
> > 5: (()+0x4dddd7) [0x55955fdafdd7]
> > 6: (PG::MissingLoc::add_active_missing(pg_missing_set<false>
const&)+0x1e0) [0x55955ffa0cb0]
> > 7: (PG::activate(ObjectStore::Transaction&, unsigned int, std::map<int,
std::map<spg_t, pg_query_t, std::less<spg_t>,
std::allocator<std::pair<spg_t const, pg_query_t> > >,
std::less<int>, std::allocator<std::pair<int const, std::map<spg_t,
pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const,
pg_query_t> > > > > >&, std::map<int,
std::vector<std::pair<pg_notify_t, PastIntervals>,
std::allocator<std::pair<pg_notify_t, PastIntervals> > >,
std::less<int>, std::allocator<std::pair<int const,
std::vector<std::pair<pg_notify_t, PastIntervals>,
std::allocator<std::pair<pg_notify_t, PastIntervals> > > > > >*,
PG::RecoveryCtx*)+0x1916) [0x55955ff3f1e6]
> > 8:
(PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::my_context)+0x370) [0x55955ff62d20]
> > 9: (boost::statechart::simple_state<PG::RecoveryState::Peering,
PG::RecoveryState::Primary, PG::RecoveryState::GetInfo,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xfb) [0x55955ffa8d5b]
> > 10: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_queued_events()+0x97)
[0x55955ff88507]
> > 11: (PG::handle_activate_map(PG::RecoveryCtx*)+0x1a8) [0x55955ff75848]
> > 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*)+0x61d) [0x55955feb161d]
> > 13: (OSD::dequeue_peering_evt(OSDShard*, PG*,
std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa6) [0x55955feb2d16]
> > 14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x51) [0x55956011a481]
> > 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55955fea7bbf]
> > 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x559560448976]
> > 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55956044b490]
> > 18: (()+0x7e25) [0x7ff5669bae25]
> > 19: (clone()+0x6d) [0x7ff565a9a34d]
> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
> >
> > --- logging levels ---
> > 0/ 5 none
> > 0/ 1 lockdep
> > 0/ 1 context
> > 1/ 1 crush
> > 1/ 5 mds
> > 1/ 5 mds_balancer
> > 1/ 5 mds_locker
> > 1/ 5 mds_log
> > 1/ 5 mds_log_expire
> > 1/ 5 mds_migrator
> > 0/ 1 buffer
> > 0/ 1 timer
> > 0/ 1 filer
> > 0/ 1 striper
> > 0/ 1 objecter
> > 0/ 5 rados
> > 0/ 5 rbd
> > 0/ 5 rbd_mirror
> > 0/ 5 rbd_replay
> > 0/ 5 journaler
> > 0/ 5 objectcacher
> > 0/ 5 client
> > 1/ 5 osd
> > 0/ 5 optracker
> > 0/ 5 objclass
> > 1/ 3 filestore
> > 1/ 3 journal
> > 0/ 0 ms
> > 1/ 5 mon
> > 0/10 monc
> > 1/ 5 paxos
> > 0/ 5 tp
> > 1/ 5 auth
> > 1/ 5 crypto
> > 1/ 1 finisher
> > 1/ 1 reserver
> > 1/ 5 heartbeatmap
> > 1/ 5 perfcounter
> > 1/ 5 rgw
> > 1/ 5 rgw_sync
> > 1/10 civetweb
> > 1/ 5 javaclient
> > 1/ 5 asok
> > 1/ 1 throttle
> > 0/ 0 refs
> > 1/ 5 xio
> > 1/ 5 compressor
> > 1/ 5 bluestore
> > 1/ 5 bluefs
> > 1/ 3 bdev
> > 1/ 5 kstore
> > 4/ 5 rocksdb
> > 4/ 5 leveldb
> > 4/ 5 memdb
> > 1/ 5 kinetic
> > 1/ 5 fuse
> > 1/ 5 mgr
> > 1/ 5 mgrc
> > 1/ 5 dpdk
> > 1/ 5 eventtrace
> > 1/ 5 prioritycache
> > -2/-2 (syslog threshold)
> > -1/-1 (stderr threshold)
> > max_recent 10000
> > max_new 1000
> > log_file /var/log/ceph/ceph-osd.299.log
> > --- end dump of recent events ---
> >
> >
> > [2]
> > [root@ceph-mon01 ~]# ceph pg 36.321b list_unfound
> > {
> > "num_missing": 1,
> > "num_unfound": 1,
> > "objects": [
> > {
> > "oid": {
> > "oid": "hit_set_36.321b_archive_2020-02-24
21:15:16.792846_2020-02-24 21:15:32.457855",
> > "key": "",
> > "snapid": -2,
> > "hash": 12827,
> > "max": 0,
> > "pool": 36,
> > "namespace": ".ceph-internal"
> > },
> > "need": "1352209'2834660",
> > "have": "0'0",
> > "flags": "none",
> > "locations": []
> > }
> > ],
> > "more": false
> > }
> > [root@ceph-mon01 ~]# ceph pg 36.324a list_unfound
> > {
> > "num_missing": 1,
> > "num_unfound": 1,
> > "objects": [
> > {
> > "oid": {
> > "oid": "hit_set_36.324a_archive_2020-02-25
12:40:58.130723_2020-02-25 12:46:25.260587",
> > "key": "",
> > "snapid": -2,
> > "hash": 12874,
> > "max": 0,
> > "pool": 36,
> > "namespace": ".ceph-internal"
> > },
> > "need": "1361100'2822063",
> > "have": "0'0",
> > "flags": "none",
> > "locations": []
> > }
> > ],
> > "more": false
> > }
> > [root@ceph-mon01 ~]# ceph pg 36.10dc list_unfound
> > {
> > "num_missing": 1,
> > "num_unfound": 1,
> > "objects": [
> > {
> > "oid": {
> > "oid": "hit_set_36.10dc_archive_2020-02-25
12:40:58.129048_2020-02-25 12:45:02.202268",
> > "key": "",
> > "snapid": -2,
> > "hash": 4316,
> > "max": 0,
> > "pool": 36,
> > "namespace": ".ceph-internal"
> > },
> > "need": "1361089'2838543",
> > "have": "0'0",
> > "flags": "none",
> > "locations": []
> > }
> > ],
> > "more": false
> > }
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io