On Mon, Dec 2, 2019 at 3:42 PM Romit Misra <romit.misra@flipkart.com> wrote:

Hi Robert,
 I am not quite sure if I get your question correct, but what I understand is that you want the inbound writes to land on the cache tier, which presumably would be on a faster media, possibily a ssd.

From there you would want it to trickle down to the base tier, which is a EC pool hosted on HDD.

Some of the pointers I have :-
It is better to have seperate media for base and cache , HDD and SSD respectively.

If the intent is never to promote to cache tier on Read, you could set it to a high number such as 3, and at the same time, Make the bloom filter window small.( This basically translates into if the object has been read X number of times in past y seconds)

Keep in mind the larger the window, the more the size of the bloom filter, and hence you would see a increase in osd memory usgae.

I have patch somewhere lurking which disables the promotes, let me check on the same, if this is for a specific case.

If your intent is to have a constant decay rate from the Cache tier to the base tier, here is what you could do.:-

1.Set the Max Objects on the Cache tier TO X 
2.Set the Max Size to say Y, this would be normally 60-70 percent of the total cache tier capacity.
3.The flushes would start happening on the first trigger of the above thresholds.
4. You could set the evict age roughly double the time, you expect the data will hit the base tier.
5.Lastly have you tried running cosbench or any related tool, to qualify the IOPS of your base tier with EC enabled, you may. Or require the cache tier at all.
6. There are substantial overheads of a cache tier maintenance, the major being absence of throttles on how the flush happens.
7.A thundering herd of write requests can cause a huge amount of flush to happen to the base tier.
8.IMHO it is suitable and predictable for loads where number of ingress requests can be predicted and there is some kind of rate limiting on the same.

A little more background is in order. We have a cluster that was set up years ago in haste (Jewell) and now that I'm here, I'm trying to make it better. We have CephFS running in 3x replication with ~1.5PB of data on it. Given that most of the data is fairly cold, EC is a great candidate for it, however you can't change a CephFS pool after creation, otherwise I'd overlay, tier it down remove the overlay and then run on a straight EC pool. This file system is under heavy and constant use from our compute cluster so trying to rsync it anywhere is going to be really painful. So, the thought was to do the next best thing, just use tiering. I successfully configured tiering for an RBD cluster on Firefly back in the day, so I'm pretty familiar with the pitfalls.

The idea is to use the existing HDD to house both the cache and base tiers. Generally this is frowned upon, but given that we are targeting an evict age of at _least_ 30 days I think we would be okay from a performance perspective. The idea is to not promote things at all for reads, and to try to not do it at all for writes. The data that is 30+ days has an extremely low likelihood of being written again, but it will most likely get read every so often. The cache tiering in this case is more for space management rather than performance. New writes and reads to new data would come out of the cache tier, but once the file has not been written to in 30 days, it would be moved to the EC tier. Our data is generally 10% write and 90% reads. Our metadata is on SSD in the pool, but we don't have enough SSD to hold our working dataset, or anything close to what I would feel comfortable with.

I couldn't recall if a bloom filter was even needed if we are not interested in promoting things up. If a bloom filter is required, then it sounds like a shorter one is better to be ignored than a longer one with the high hitset_count option.

I think it's an interesting use case and if it doesn't work, we have time to test it and remove the caching if it doesn't work.

Thank you for your insights and I'm interested to hear what you think given the additional info.

Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1