On Mon, Dec 2, 2019 at 3:42 PM Romit Misra <romit.misra(a)flipkart.com> wrote:
Hi Robert,
I am not quite sure if I get your question correct, but what I understand
is that you want the inbound writes to land on the cache tier, which
presumably would be on a faster media, possibily a ssd.
From there you would want it to trickle down to the base tier, which is a
EC pool hosted on HDD.
Some of the pointers I have :-
It is better to have seperate media for base and cache , HDD and SSD
respectively.
If the intent is never to promote to cache tier on Read, you could set it
to a high number such as 3, and at the same time, Make the bloom filter
window small.( This basically translates into if the object has been read X
number of times in past y seconds)
Keep in mind the larger the window, the more the size of the bloom filter,
and hence you would see a increase in osd memory usgae.
I have patch somewhere lurking which disables the promotes, let me check
on the same, if this is for a specific case.
If your intent is to have a constant decay rate from the Cache tier to the
base tier, here is what you could do.:-
1.Set the Max Objects on the Cache tier TO X
2.Set the Max Size to say Y, this would be normally 60-70 percent of the
total cache tier capacity.
3.The flushes would start happening on the first trigger of the above
thresholds.
4. You could set the evict age roughly double the time, you expect the
data will hit the base tier.
5.Lastly have you tried running cosbench or any related tool, to qualify
the IOPS of your base tier with EC enabled, you may. Or require the cache
tier at all.
6. There are substantial overheads of a cache tier maintenance, the major
being absence of throttles on how the flush happens.
7.A thundering herd of write requests can cause a huge amount of flush to
happen to the base tier.
8.IMHO it is suitable and predictable for loads where number of ingress
requests can be predicted and there is some kind of rate limiting on the
same.
A little more background is in order. We have a cluster that was set up
years ago in haste (Jewell) and now that I'm here, I'm trying to make it
better. We have CephFS running in 3x replication with ~1.5PB of data on it.
Given that most of the data is fairly cold, EC is a great candidate for it,
however you can't change a CephFS pool after creation, otherwise I'd
overlay, tier it down remove the overlay and then run on a straight EC
pool. This file system is under heavy and constant use from our compute
cluster so trying to rsync it anywhere is going to be really painful. So,
the thought was to do the next best thing, just use tiering. I
successfully configured tiering for an RBD cluster on Firefly back in the
day, so I'm pretty familiar with the pitfalls.
The idea is to use the existing HDD to house both the cache and base tiers.
Generally this is frowned upon, but given that we are targeting an evict
age of at _least_ 30 days I think we would be okay from a performance
perspective. The idea is to not promote things at all for reads, and to try
to not do it at all for writes. The data that is 30+ days has an extremely
low likelihood of being written again, but it will most likely get read
every so often. The cache tiering in this case is more for space management
rather than performance. New writes and reads to new data would come out of
the cache tier, but once the file has not been written to in 30 days, it
would be moved to the EC tier. Our data is generally 10% write and 90%
reads. Our metadata is on SSD in the pool, but we don't have enough SSD to
hold our working dataset, or anything close to what I would feel
comfortable with.
I couldn't recall if a bloom filter was even needed if we are not
interested in promoting things up. If a bloom filter is required, then it
sounds like a shorter one is better to be ignored than a longer one with
the high hitset_count option.
I think it's an interesting use case and if it doesn't work, we have time
to test it and remove the caching if it doesn't work.
Thank you for your insights and I'm interested to hear what you think given
the additional info.
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1