Hi,
i had a problem with one application (seafile) which uses CEPH backend with librados.
The corresponding pools are defined with size=3 and each object copy is on a different
host.
The cluster health is OK: all the monitors see all the hosts.
Now, a network problem just happens between my RADOS client and a single host.
Then, when my application/client tries to access an object which is situed on the
unreachable host (primary for the corresponding PG),
it does not failover to another copy/host (and my application crashes later because after
a while, with many requests, too many files are opened on Linux).
Is it the normal behavior? My storage is resilient (great!) but not its access...
If on the host, i stop the OSDs or change the affinity to zero, it solves,
so it seems like the librados just check and trust the osdmap
And doing a tcpdump show the client tries to access the same OSD without timeout.
It can be easily reproduced with defining a netfilter rule on a host to drop packets
coming from the client.
Note: i am still on Luminous (both on lient and cluster sides).
Thanks for reading.
D.