Well, it sounds like the pdcache setting may not be possible for SSD's, which is the first I've ever heard of this.

I actually just checked another system that I forgot was behind a 3108 controller with SSD's (not ceph, so wasn't considering it).
It looks like I ran into the same issue during configuration of that then, as that vd is set to "Default" and not "Disabled".

I think the best case option is to try and configure it for "FastPath" which sadly has next to no documentation other than extolling its [purported] benefits.

The per-virtual disk configuration required for Fast Path is:
write-through cache policy
no read-ahead
direct (non-cache) I/O
It provides the commands to set these (for adapter zero, logical disk zero): megacli –LDSetProp WT Direct NORA L0 a0

I believe the storcli equivalent(s) would be
storcli /cx/vx set wrcache=wt
storcli /cx/vx set rdcache=nora
storcli /cx/vx set iopolicy=direct

At this point that feels like the best option given the current restraints.

I don't know why they don't publish those settings more readily, but at least we have the Google machine for that.

Hopefully that may eek out a bit more performance.

Reed

On Sep 3, 2020, at 5:31 AM, VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at> wrote:

In theory it should be possible to do this (to change the Block SSD Write Disk Cache Change = Yes setting)
  1. Run MegaSCU -adpsettings -write -f mfc.ini -a0
  2. Edit the mfc.ini file, setting "blockSSDWriteCacheChange" to 0 instead of 1.
  3. Run MegaSCU -adpsettings -read -f mfc.ini -a0
With MEGACLI but I get en error. It is not working for me to save the config file. No idea why…
 
My Config:
 
VD16 Properties :
===============
Strip Size = 256 KB
Number of Blocks = 3749642240
VD has Emulated PD = Yes
Span Depth = 1
Number of Drives Per Span = 1
Write Cache(initial setting) = WriteThrough
Disk Cache Policy = Enabled
Encryption = None
Data Protection = Disabled
Active Operations = None
Exposed to OS = Yes
Creation Date = 25-08-2020
Creation Time = 12:05:41 PM
Emulation type = None
 
Version :
=======
Firmware Package Build = 23.28.0-0010
Firmware Version = 3.400.05-3175
Bios Version = 5.46.02.0_4.16.08.00_0x06060900
NVDATA Version = 2.1403.03-0128
Boot Block Version = 2.05.00.00-0010
Bootloader Version = 07.26.26.219
Driver Name = megaraid_sas
Driver Version = 07.703.05.00-rc1
 
Supported Adapter Operations :
============================
Rebuild Rate = Yes
CC Rate = Yes
BGI Rate  = Yes
Reconstruct Rate = Yes
Patrol Read Rate = Yes
Alarm Control = Yes
Cluster Support = No
BBU  = Yes
Spanning = Yes
Dedicated Hot Spare = Yes
Revertible Hot Spares = Yes
Foreign Config Import = Yes
Self Diagnostic = Yes
Allow Mixed Redundancy on Array = No
Global Hot Spares = Yes
Deny SCSI Passthrough = No
Deny SMP Passthrough = No
Deny STP Passthrough = No
Support more than 8 Phys = Yes
FW and Event Time in GMT = No
Support Enhanced Foreign Import = Yes
Support Enclosure Enumeration = Yes
Support Allowed Operations = Yes
Abort CC on Error = Yes
Support Multipath = Yes
Support Odd & Even Drive count in RAID1E = No
Support Security = No
Support Config Page Model = Yes
Support the OCE without adding drives = Yes
support EKM = No
Snapshot Enabled = No
Support PFK = Yes
Support PI = Yes
Support LDPI Type1 = No
Support LDPI Type2 = No
Support LDPI Type3 = No
Support Ld BBM Info = No
Support Shield State = Yes
Block SSD Write Disk Cache Change = Yes -> this is not good as it prevents to change the SSD cache! Stupid!
Support Suspend Resume BG ops = Yes
Support Emergency Spares = Yes
Support Set Link Speed = Yes
Support Boot Time PFK Change = No
Support JBOD = Yes
Disable Online PFK Change = No
Support Perf Tuning = Yes
Support SSD PatrolRead = Yes
Real Time Scheduler = Yes
Support Reset Now = Yes
Support Emulated Drives = Yes
Headless Mode = Yes
Dedicated HotSpares Limited = No
Point In Time Progress = Yes
 
Supported VD Operations :
=======================
Read Policy = Yes
Write Policy = Yes
IO Policy = Yes
Access Policy = Yes
Disk Cache Policy = Yes (but only HDD’s in this case)
Reconstruction = Yes
Deny Locate = No
Deny CC = No
Allow Ctrl Encryption = No
Enable LDBBM = No
Support FastPath = Yes
Performance Metrics = Yes
Power Savings = No
Support Powersave Max With Cache = No
Support Breakmirror = No
Support SSC WriteBack = No
Support SSC Association = No
 
Von: Reed Dier <reed.dier@focusvq.com> 
Gesendet: Mittwoch, 02. September 2020 19:34
An: VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)
 
Just for the sake of curiosity, if you do a show all on /cX/vX, what is shown for the VD properties?
VD0 Properties :
==============
Strip Size = 256 KB
Number of Blocks = 1953374208
VD has Emulated PD = No
Span Depth = 1
Number of Drives Per Span = 1
Write Cache(initial setting) = WriteBack
Disk Cache Policy = Disabled
Encryption = None
Data Protection = Disabled
Active Operations = None
Exposed to OS = Yes
Creation Date = 17-06-2016
Creation Time = 02:49:02 PM
Emulation type = default
Cachebypass size = Cachebypass-64k
Cachebypass Mode = Cachebypass Intelligent
Is LD Ready for OS Requests = Yes
SCSI NAA Id = 600304801bb4c0001ef6ca5ea0fcb283
 
I'm wondering if the pdcache value must be set at vd creation, as it is a creation option as well.
If that's the case, maybe consider blowing away one of the SSD vd's and recreating the vd and OSD, and see if you can measure a difference on that disk specifically in testing.
 
It might also be helpful to document some of these values from /cX show all
 
Version :
=======
Firmware Package Build = 24.7.0-0026
Firmware Version = 4.270.00-3972
Bios Version = 6.22.03.0_4.16.08.00_0x060B0200
Ctrl-R Version = 5.08-0006
Preboot CLI Version = 01.07-05:#%0000
NVDATA Version = 3.1411.00-0009
Boot Block Version = 3.06.00.00-0001
Driver Name = megaraid_sas
Driver Version = 07.703.05.00-rc1
 
Supported Adapter Operations :
============================
Support Shield State = Yes
Block SSD Write Disk Cache Change = Yes
Support Suspend Resume BG ops = Yes
Support Emergency Spares = Yes
Support Set Link Speed = Yes
Support Boot Time PFK Change = No
Support JBOD = Yes
 
Supported VD Operations :
=======================
Read Policy = Yes
Write Policy = Yes
IO Policy = Yes
Access Policy = Yes
Disk Cache Policy = Yes
Reconstruction = Yes
Deny Locate = No
Deny CC = No
Allow Ctrl Encryption = No
Enable LDBBM = No
Support FastPath = Yes
Performance Metrics = Yes
Power Savings = No
Support Powersave Max With Cache = No
Support Breakmirror = Yes
Support SSC WriteBack = No
Support SSC Association = No
Support VD Hide = Yes
Support VD Cachebypass = Yes
Support VD discardCacheDuringLDDelete = Yes
 
 
Advanced Software Option :
========================
 
----------------------------------------
Adv S/W Opt        Time Remaining  Mode
----------------------------------------
MegaRAID FastPath  Unlimited       -
MegaRAID RAID6     Unlimited       -
MegaRAID RAID5     Unlimited       -
----------------------------------------
 
 
Namely, on my 3108 controller, Block SSD Write Disk Cache Change = Yes, stands out to me.
My controller has SAS HDD's behind it, though so I just may not be running into the same issue, that may pertain to me.
Also wondering if FastPath is enabled as well. I know on some of the older controllers, it was a paid feature enable, but they then opened it up for free, though you may need a software key to enable it (for free).
 
Just looking to widen the net and hope we catch something.
 
Reed


On Sep 2, 2020, at 7:38 AM, VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at> wrote:
 
I assume you are referencing this parameter?


storcli /c0/v0 set ssdcaching=<on|off>


If so, this is for CacheCade, which is LSI's cache tiering solution, which should both be off and not in use for ceph.

No storcli /cx/vx set pdcache=off is denied because of the lsi setting "Block SSD Write Disk Cache Change = Yes"
I cannot find any firmware to upload or way to change this

Do you think that disabling the write cache also on the ssd helps a lot (ceph is not aware of this because 'smartctl -g wcache /dev/sdX shows cache disabled - because the cache on the lsi is disabled allready)
The only way would be to buy some hba cards and add it to the server. But that’s a lot of work - not knowing that this will improve the speed a lot.

I am using rbd with hyperconvergenced nodes (4 at the moment) pools are 2 and 3 times replicated. actually the performance for windows and linux vms with the hdd osd pool was ok. But with the time getting a little bit more slow. I just want to get ready for the future. and we plan to put some bigger database servers on the cluster (they are on local storage at the moment) and therefore I want to increase the random small iops of the cluster a lot

-----Ursprüngliche Nachricht-----
Von: Reed Dier <reed.dier@focusvq.com> 
Gesendet: Dienstag, 01. September 2020 23:44
An: VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)


there is an option set in the controller "Block SSD Write Disk Cache Change = Yes" which does not permit to deactivate the ssd cache. I could not find any solution in google for this controller (LSI MegaRAID SAS 9271-8i) to change this setting.


I assume you are referencing this parameter?

storcli /c0/v0 set ssdcaching=<on|off>

If so, this is for CacheCade, which is LSI's cache tiering solution, which should both be off and not in use for ceph.

Single thread and single iodepth benchmarks will tend to be underwhelming.
Ceph shines with aggregate performance from lots of clients.
And in an odd twist of fate, I typically see better performance on RBD for random benchmarks rather than sequential benchmarks, as it distributes the load across more OSD's.

Might also help others offer some pointers for tuning if you describe the pool/application a bit more.

Ie RBD vs cephfs vs RGW, 3x replicated vs EC, etc.

At least things are trending in a positive direction.

Reed


On Sep 1, 2020, at 4:21 PM, VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at> wrote:

Thank you. I was working in this direction. The situation is a lot better. But I think I can get still far better.

I could set the controller to writethrough, direct and no read ahead for the ssds.
But I cannot disable the pdcache  there is an option set in the controller "Block SSD Write Disk Cache Change = Yes" which does not permit to deactivate the ssd cache. I could not find any solution in google for this controller (LSI MegaRAID SAS 9271-8i) to change this setting.

I don’t know how much performance gain it will be to deactivate the ssd cache. At least the micron 5200max has capacitor so I hope it is safe for data loss in case if power failure. I wrote a request to lsi / Broadcom if they know how I can change this setting. This is really annyoing.

I will check the cpu power settings. I rode also somewhere it can improve iops a lot. (if its bad set)

At the moment I get 600iops 4k random write 1 thread and 1 iodepth. I get 40K - 4k random iops for some instances with 32iodepth. Its not the world but a lot better then before. Read around 100k iops. For 16 ssd's and 2 x dual 10G nic.

I was reading that good tunings and hardware config can get more then 2000 iops on single thread out of the ssds. I know thet ceph does not shine with single thread. But 600 iops is not very much...

philipp

-----Ursprüngliche Nachricht-----
Von: Reed Dier <reed.dier@focusvq.com> 
Gesendet: Dienstag, 01. September 2020 22:37
An: VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

If using storcli/perccli for manipulating the LSI controller, you can disable the on-disk write cache with:
storcli /cx/vx set pdcache=off

You can also ensure that you turn off write caching at the controller level with 
storcli /cx/vx set iopolicy=direct
storcli /cx/vx set wrcache=wt

You can also tweak the readahead value for the vd if you want, though with an ssd, I don't think it will be much of an issue.
storcli /cx/vx set rdcache=nora

I'm sure the megacli alternatives are available with some quick searches.

May also want to check your c-states and p-states to make sure there isn't any aggressive power saving features getting in the way.

Reed


On Aug 31, 2020, at 7:44 AM, VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at> wrote:

We have older LSi Raid controller with no HBA/JBOD option. So we expose the single disks as raid0 devices. Ceph should not be aware of cache status?
But digging deeper in to it it seems that 1 out of 4 serves is performing a lot better and has super low commit/applay rates while the other have a lot mor (20+) on heavy writes. This just applys fore the ssd. For the hdds I cant see a difference...

-----Ursprüngliche Nachricht-----
Von: Frank Schilder <frans@dtu.dk> 
Gesendet: Montag, 31. August 2020 13:19
An: VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at>; 'ceph-users@ceph.io' <ceph-users@ceph.io>
Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

Yes, they can - if volatile write cache is not disabled. There are many threads on this, also recent. Search for "disable write cache" and/or "disable volatile write cache".

You will also find different methods of doing this automatically.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: VELARTIS Philipp Dürhammer <p.duerhammer@velartis.at>
Sent: 31 August 2020 13:02:45
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is little slower or equal to the 60 hdd pool. 4K random as also sequential reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-leave@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io