Hello.
I have tried to follow through the documented writeback cache tier
removal procedure
(
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-…)
on a test cluster, and failed.
I have successfully executed this command:
ceph osd tier cache-mode alex-test-rbd-cache proxy
Next, I am supposed to run this:
rados -p alex-test-rbd-cache ls
rados -p alex-test-rbd-cache cache-flush-evict-all
The failure mode is that, while the client i/o still going on, I
cannot get zero objects in the cache pool, even with the help of
"rados -p alex-test-rbd-cache cache-flush-evict-all". And yes, I have
waited more than 20 minutes (my cache tier has hit_set_count 10 and
hit_set_period 120).
I also tried to set both cache_target_dirty_ratio and
cache_target_full_ratio to 0, it didn't help.
Here is the relevant part of the pool setup:
# ceph osd pool ls detail
pool 25 'alex-test-rbd-metadata' replicated size 3 min_size 2
crush_rule 9 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 10973111 lfor 0/10971347/10971345 flags
hashpspool,nodelete stripe_width 0 application rbd
pool 26 'alex-test-rbd-data' erasure size 6 min_size 5 crush_rule 12
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn
last_change 10973112 lfor 10971705/10971705/10971705 flags
hashpspool,ec_overwrites,nodelete,selfmanaged_snaps tiers 27 read_tier
27 write_tier 27 stripe_width 16384 application rbd
removed_snaps [1~3]
pool 27 'alex-test-rbd-cache' replicated size 3 min_size 2 crush_rule
9 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 10973113 lfor 10971705/10971705/10971705 flags
hashpspool,incomplete_clones,nodelete,selfmanaged_snaps tier_of 26
cache_mode proxy target_bytes 10000000000 hit_set
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s
x10 decay_rate 0 search_last_n 0 stripe_width 0 application rbd
removed_snaps [1~3]
The relevant crush rules are selecting ssds for the
alex-test-rbd-cache and alex-test-rbd-metadata pools (plain old
"replicated size 3" pools), and hdds for alex-test-rbd-data (which is
EC 4+2).
The client workload, which seemingly outpaces the eviction and flushing, is:
for a in `seq 1000 2000` ; do
time rbd import --data-pool alex-test-rbd-data
./Fedora-Cloud-Base-32-1.6.x86_64.raw
alex-test-rbd-metadata/Fedora-copy-$a
done
The ceph version is "ceph version 14.2.9
(2afdc1f644870fb6315f25a777f9e4126dacc32d) nautilus (stable)" on all
osds.
The relevant part of "ceph df" is:
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 23 TiB 20 TiB 2.9 TiB 3.0 TiB 12.99
ssd 1.7 TiB 1.7 TiB 19 GiB 23 GiB 1.28
TOTAL 25 TiB 22 TiB 2.9 TiB 3.0 TiB 12.17
POOLS:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL
<irrelevant pools omitted>
alex-test-rbd-metadata 25 237 KiB 2.37k 59
MiB 0 564 GiB
alex-test-rbd-data 26 691 GiB 198.57k 1.0
TiB 6.52 9.7 TiB
alex-test-rbd-cache 27 5.1 GiB 2.99k 15
GiB 0.90 564 GiB
The total size and the number of stored objects in the
alex-test-rbd-cache pool oscillate around 5 GB and 3K, respectively,
while "rados -p alex-test-rbd-cache cache-flush-evict-all" is running
in a loop. Without it, the size grows to 6 GB and stays there.
# ceph -s
cluster:
id: <omitted for privacy>
health: HEALTH_WARN
1 cache pools at or near target size
services:
mon: 3 daemons, quorum xx-4a,xx-3a,xx-2a (age 10d)
mgr: xx-3a(active, since 5w), standbys: xx-2b, xx-2a, xx-4a
mds: cephfs:1 {0=xx-4b=up:active} 2 up:standby
osd: 89 osds: 89 up (since 7d), 89 in (since 7d)
rgw: 3 daemons active (xx-2b, xx-3b, xx-4b)
tcmu-runner: 6 daemons active (<only irrelevant images here>)
data:
pools: 15 pools, 1976 pgs
objects: 6.64M objects, 1.3 TiB
usage: 3.1 TiB used, 22 TiB / 25 TiB avail
pgs: 1976 active+clean
io:
client: 290 KiB/s rd, 251 MiB/s wr, 366 op/s rd, 278 op/s wr
cache: 123 MiB/s flush, 72 MiB/s evict, 31 op/s promote, 3 PGs
flushing, 1 PGs evicting
Is there any workaround, short of somehow telling the client to stop
creating new rbds?
--
Alexander E. Patrakov
CV:
http://pc.cd/PLz7