The main reason for SSDs is typically to improve IOPS for small writes, but for that usage most (all) consumer SSDs we have tested perform badly in Ceph.
The reason for this is that Ceph requires SYNC writes, and since consumer SSDs (and now even some cheap datacenter ones) don't have capacitors for power-loss-protection they cannot use the volatile caches that give them (semi-fake) good performance on desktops.
If that sounds bad, you should be even more careful of you shop around until you find a cheap drive that performs well - because there have historically been consumer drives that lie and acknowledge a sync even if the data is just in volatile memory rather than safe :-)
Samsung PM883 is likely one of the cheapest drives that you can still fully trust - at least if your application is not highly write-intensive.
Now, having said that, we have had pretty good experience with a way to partly cheat around these limitations: since we have large servers with mixed HDDs we also have 2-3 NVMe samsung PM983 M.2 drives per server on PCIe cards for the DB/wal. It seems to work remarkably well to do this for consumer SSDs to, I.e. let each 4TB el cheapo SATA SSD (we used Samsung 860) use a ~100GB db/wal partition on an NVMe drive. This gives very nice low latencies in rados benchmarks, although they are still ~50% higher than with proper enterprise SSDs.
Caveats:
- Think about balancing IOPS. If you have 10 SSD OSDs share a single NVMe WAL device you will likely be limited by the NVMe.
- if the NVMe drive dies, all the corresponding OSDs die.
- This might work for read-intensive applications, but if you try it for write-intensive applications you will wear out the consumer SSDs (check their write endurance).
- You will still see latency/bandwidth go up/down and periodically throttle for consumer SSDs.
In comparison, even the relatively cheap pm883 "just works" at constant high bandwidth close to the bus limit, and the latency is a constant low fraction of a millisecond in ceph.
In summary, while somewhat possible, I don't think it's worth the hassle/risk/complex setup with consumer drives, but if I absolutely had to i would at least avoid the absolutely cheapest QVO models - and if you don't put the WAL on a better device I predict you'll regret it once you start doing benchmarks in RADOS.
Cheers,
Erik
Hi,
Is there any way to log the x-amz-request-id along with the request in
the rgw logs? We're using beast and don't see an option in the
configuration documentation to add headers to the request lines. We
use centralized logging and would like to be able to search all layers
of the request path (edge, lbs, ceph, etc) with a x-amz-request-id.
Right now, all we see is this:
debug 2021-04-01T15:55:31.105+0000 7f54e599b700 1 beast:
0x7f5604c806b0: x.x.x.x - - [2021-04-01T15:55:31.105455+0000] "PUT
/path/object HTTP/1.1" 200 556 - "aws-sdk-go/1.36.15 (go1.15.3; linux;
amd64)" -
We've also tried this:
ceph config set global rgw_enable_ops_log true
ceph config set global rgw_ops_log_socket_path /tmp/testlog
After doing this, inside the rgw container, we can socat -
UNIX-CONNECT:/tmp/testlog and see the log entries being recorded that
we want, but there has to be a better way to do this, where the logs
are emitted like the request logs above by beast, so that we can
handle it using journald. If there's an alternative that would
accomplish the same thing, we're very open to suggestions.
Thank you,
David
Nautilus cluster is not unmapping
ceph 14.2.16
ceph report |grep "osdmap_.*_committed"
report 1175349142
"osdmap_first_committed": 285562,
"osdmap_last_committed": 304247,
we've set osd_map_cache_size = 20000
but its is slowly growing to that difference as well
OSD map first committed is not changing for some strange reason
Cluster has been around and upgraded since either firefly or jewel
I have seen a few other with this problem to no solution to it
Any suggestions ?
Thanks Joe
Hello.
I was have multisite RGW (14.2.16 nautilus) setup and some of the
bucket couldn't finish bucket sync due to overfill buckets,
There was different needs and the sync started purpose of migration.
I made the secondary zone the master and removed the old master zone
from zonegroup.
Now I still have sync errors and sync error trim do not work.
radosgw-admin --id radosgw.srv1 sync error list | grep name | wc -l
32000
Thats a lot of errors. Sync error trim does nothing.
When I run period update commit I saw sync status field has a lot of
records as below.
radosgw-admin --id radosgw.srv1 period update --commit
{
"id": "e5d30f8f",
"epoch": 7,
"predecessor_uuid": "1d0b7132",
"sync_status": [
"1_1611733356.499643_1448979853.1",
"1_1611225916.734727_865381974.1",
"1_1611648125.876993_1659659292.1",
"1_1608194415.061001_737663090.1",
"1_1605880458.143435_1259922694.1",
"1_1611225999.087089_1887995199.1",
"1_1586035175.626619_488028.1",
"",
"",
"1_1611057887.910246_973493243.1",
"1_1612180963.822684_807349060.1",
"",
"",
"1_1612180818.328001_807344892.1",
"1_1611058156.662721_1887884194.1",
"1_1611057588.159455_1887883796.1",
"1_1611647015.874625_1129837262.1",
"1_1586035175.602419_753756.1",
"",
"1_1606215091.912960_988474411.1",
"",
"1_1600418137.932356_1027064325.1",
"1_1609926537.036681_832230841.1",
"",
"",
"1_1611057624.857485_1658280806.1",
"1_1600419671.553723_365405366.1",
"",
"1_1611057662.014628_859134308.1",
"1_1611057665.933662_843443436.1",
"1_1605879154.805811_700811071.1",
"1_1602509494.904964_696294030.1",
"",
"1_1611057618.891024_1150752303.1",
"1_1611440831.055432_1458827253.1",
"1_1611451128.857514_806931659.1",
"",
"1_1611057597.877068_1785564634.1",
"1_1611057860.565465_1785564826.1",
"1_1585821684.950844_61616.1",
"",
"",
"",
"1_1601647994.988107_511440126.1",
"",
"1_1608194424.578834_777512349.1",
"1_1605879126.845904_958578574.1",
"",
"1_1590061636.162223_183644368.1",
"1_1609834839.884870_1076396513.1",
"",
"1_1612430017.546386_612493167.1",
"1_1605879158.230856_1635059634.1",
"",
"1_1612420115.322098_1468865033.1",
"1_1611057731.182423_817020944.1",
"1_1611225026.887795_806142997.1",
"1_1612188490.428048_1152864210.1",
"1_1612187913.914410_861646554.1",
"1_1609393942.952120_574675578.1",
"1_1611733086.223927_861322773.1",
"1_1605880394.928467_759903023.1",
"1_1600418082.175862_556536400.1",
"1_1605879150.320951_1210709666.1"
],
"period_map": {
"id": "e5d30f8f",
"zonegroups": [
{
"id": "667afef",
"name": "xy",
"api_name": "xy",
"is_master": "true",
"endpoints": [
"http://dns:80"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "fe8ee939",
"zones": [
{
"id": "fe8ee939",
"name": "prod",
"endpoints": [
"http://dns:80"
],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 101,
"read_only": "false",
"tier_type": "",
"sync_from_all": "false",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "234837df"
}
],
"short_zone_ids": [
{
"key": "fe8ee939",
"val": 2970845644
}
]
},
"master_zonegroup": "667afefc",
"master_zone": "fe8ee939",
"period_config": {
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
},
"realm_id": "234837df",
"realm_name": "rep",
"realm_epoch": 3
}
I need to clean these errors before re-add the secondary zone to zonegroup.
Do you have any opinion?
If I delete old periods what will happen?
Hi everyone,
I'm new to ceph, and I'm currently doing some tests with cephadm and a few virtual machines.
I've deployed ceph with cephadm on 5 VMs, each one have a 10GB virtual disk attached, and everything is working perfectly. (So 5 osd and 5 monitor in my cluster) However when I turn down a node, wait a few minutes and turn it up again, I was expecting the services to be running again automatically, but this is not happening... The monitor service does not restart. It looks like the monmap gets changed when a node is offline, causing the monitor service to refuse to restart. I need to remove the monitor service with `ceph orch daemon rm <name>` so that a new service is automatically deploy again on this node.
Is that the expected behaviour ?
Best regards,
Maël
Hi all,
upon updating to 16.2.2 via cephadm the upgrade is being stuck on the
first mgr
Looking into this via docker logs I see that it is still loading modules
when it is apparently terminated and restarted in a loop
When pausing the update, the mgr succeeds to start with the new version,
however when resuming the update, it seems to try to update it again
even tho it already has the new version, leading to the exact same loop.
Is there some setting or workaround to increase the time before it is
attempted to be redeployed, or can this behavior be caused by something
else?
Greetings,
Kai
Hi guys,
Did anyone ever figure out how to fix rctime? I had a directory that was
robocopied from a windows host that contained files with modified times in
the future. Now the directory tree up to the root will not update rctime.
Thanks,
David