May 2020 - ceph-users - lists.ceph.io

Re: 15.2.2 Upgrade - Corruption: error in middle of record

by Chris Palmer

I'm getting similar errors after rebooting a node. Cluster was upgraded 15.2.1 -> 15.2.2 yesterday. No problems after rebooting during upgrade. On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from both. Logs from one below. Neither OSDs have compression enabled, although there is a compression-related error in the log. Both are replicated x3. One has data on HDD & separate WAL/DB on NVMe partition, the other is everything on NVMe partition only. Feeling kinda nervous here - advice welcomed!! Thx, Chris 2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 3423870535 in db/000304.sst offset 18446744073709551615 size 18446744073709551615 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-000312 succeeded,manifest_file_number is 312, next_file_number is 314, last_sequence is 22320582, log_number is 309,prev_log_number is 0,max_column_family is 0,min_log_number_to_keep is 0 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/version_set.cc:3766] Column family [default] (ID 0), log number is 309 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1589976840843199, "job": 1, "event": "recovery_started", "log_files": [313]} 2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #313 mode 0 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes; Corruption: error in middle of record 2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes; Corruption: missing start of fragmented record(2) 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work 2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb: [db/db_impl.cc:563] Shutdown complete 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: error in middle of record 2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db: 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc 0x55daf2b3a900 shutdown 2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700 /var/lib/ceph/osd/ceph-9/block) close 2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000 /var/lib/ceph/osd/ceph-9/block) close 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: unable to mount object store 2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m

3 years, 11 months

4
12
0 0

Re: question on ceph node count

by tim taler

Thnx Anthony on a 5 node cluster, replica 5, 5 MON, failure domain is host: > > any two nodes might fail without leaving the cluster > > unresponsive? > > No. Assuming again that your failure domain is “host”, some PGs will have two copies on these two nodes, so they will be undersized until redundancy is restored. > ouch, right, but aren't PGs checksummed so from the remaining PG (given the checksum would be right) two new copies could be created?

3 years, 11 months

1
0
0 0

question on ceph node count

by tim taler

Hi all, stumbling over a new ceph cluster setup, I got a basic question regarding the behaviour of ceph. The cluster I found runs 4 hardware nodes as hyper convergent instances - 3 nodes running MON and several OSD instances while one node only runs several OSD. At the same time, all nodes serve as KVM hosts, running several VMs. Replication is set to three. 1.) Am I right that in this setup only one node may fail without leaving the cluster unresponsive? (i.e. rejecting further IO)? if two nodes fail, the cluster will stop serving IO (but probably won't lose data if at least one of the two failing nodes can recover) 2.) Am I further right, that this setup with four nodes might provide a little better IO to the VMs than if it would run on only three nodes (all of whom run OSD and OSD and KVM), but 3.) in a setup with 5 nodes, the IO performance delivered to the VMs would be even better and if all of the 5 nodes would run MON and OSDs and KVM, any two nodes might fail without leaving the cluster unresponsive? (sure one would have to move/restart the VMs on the failing nodes ... just from the underlying data integrity point of view) Sorry if it sounds dumb, seems sort of obvious to me, but I have to nail this cristal clear. Short, comprehensive answer would be most appreciated! Best regards!

3 years, 11 months

1
0
0 0

S3 key prefixes and performance impact on Ceph?

by malinsk＠databento.com

I've just set up a Ceph cluster and I'm accessing it via object gateway with S3 API. One thing I don't see documented anywhere is - how does Ceph performance scale with S3 key prefixes? In AWS S3, performance scales linearly with key prefix (see: https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html). I see the keys as a nested hash table or nodes of a prefix tree, where each prefix is stored in closer proximity at a hardware level - you want to spread reads evenly over prefixes to avoid parallel I/O being concentrated on the same hot spots. So for example if my access pattern regularly involves scanning data through multiple dates for a single city, this key structure will be more effective: `yyyymmdd/city/data.csv`. Whereas if my access pattern involves scanning through different cities on a single date, `city/yyyymmdd/data.csv` would be more effective. How about Ceph? Does naming convention of the key prefixes have an effect on Ceph's object gateway performance or does it treat the full object "paths" as a completely flat namespace?

3 years, 11 months

3
2
0 0

Bluestore config recommendations

by Adrian Nicolae

Hi, I'm planning to install a new Ceph cluster (Nautilus) using 8+3 EC, SATA-only storage. We want to store here only big files (from 40-50MB to 200-300GB each). The write load will be higher than the read load . I was thinking at the following Bluestore config to reduce the load on the drives as much as possible and also to have a good storage efficiency : - stripe_unit for Jerasure EC profile = 256K - bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M - rados obj size = 4M - rgw_max_chunk_size=4M = rgw_obj_stripe_size What do you think ? Is there any gain in using a higher rgw_max_chunk_size and/or rgw_obj_stripe_size (for example 20M instead of 4M) ? What about bluestore_min_alloc_size and bluefs_alloc_size ? Should I lower them to 512K or 256K ? Thanks, Adrian

3 years, 11 months

1
0
0 0

Pool full but the user cleaned it up already

by Szabo, Istvan (Agoda)

Hi, I have a health warn regarding pool full: health: HEALTH_WARN 1 pool(s) full This is the pool that is complaining: Ceph df: NAME ID USED %USED MAX AVAIL OBJECTS k8s 8 200GiB 0.22 90.5TiB 51860 Rados df: POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR k8s-infraops 200GiB 51860 0 155580 0 0 0 1256337 126GiB 19012858 944GiB What needs to be cleaned to avoid the pool full? Ceph version luminous 12.2.8. Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 11 months

2
9
0 0

Re: ceph orch upgrade stuck at the beginning.

by Gencer W. Genç

Hi Sebastian, I did not get your reply via e-mail. I am very sorry for this. I hope you can see this message... I've re-run the upgrade and attached the log. Thanks, Gencer.

3 years, 12 months

1
0
0 0

Setting up first cluster on proxmox - a few questions

by CodingSpiderFox

Hello everyone :) When I try to create an OSD, Proxmox UI asks for * Data disk * DB disk * WAL disk What disk will be the limiting factor in terms of storage size for my OSD - the data disk? How large do I need to make the other two? Is there a risk of them running over capacity before the capacity of the OSD is reached? If so, how to avoid this (in terms of usage pattern)? Have a nice day :)

3 years, 12 months

2
1
0 0

diskprediction_local prediction granularity

by Vytenis A

Hi list, Looking into diskprediction_local module, and I see that it only predicts a few states: good, warning and bad: ceph/src/pybind/mgr/diskprediction_local/predictor.py: if score > 10: return "Bad" if score > 4: return "Warning" return "Good" The predicted fail date is just a derivative based on that: ceph/src/pybind/mgr/diskprediction_local/module.py if result.lower() == 'good': life_expectancy_day_min = (TIME_WEEK * 6) + TIME_DAYS life_expectancy_day_max = None elif result.lower() == 'warning': life_expectancy_day_min = (TIME_WEEK * 2) life_expectancy_day_max = (TIME_WEEK * 6) elif result.lower() == 'bad': life_expectancy_day_min = 0 life_expectancy_day_max = (TIME_WEEK * 2) - TIME_DAYS Is it possible to get any finer prediction date? -- Vytenis

3 years, 12 months

2
2
0 0

Large omap

by Szabo, Istvan (Agoda)

Hi, I have in one of my cluster a large omap object under luminous 12.2.8. HEALTH_WARN 1 large omap objects LARGE_OMAP_OBJECTS 1 large omap objects 1 large objects found in pool 'default.rgw.log' Search the cluster log for 'Large omap object found' for more details. In my setup the ha-proxy load balancer is running on 3 osd nodes which is monitor and manager also. When I look for this large omap object, this is the one: for i in `ceph pg ls-by-pool default.rgw.log | tail -n +2 | awk '{print $1}'`; do echo -n "$i: "; ceph pg $i query |grep num_large_omap_objects | head -1 | awk '{print $2}'; done | grep ": 1" 4.d: 1 I found only this way to reduce the size: radosgw-admin usage trim --end-date=2019-05-01 --yes-i-really-mean-it And I ran this many times like months by months. However when this is running the RGW became completely unreachable, the loadbalancer started to flapping and users started to complain because they can't do anything. Is there any other way to fix it, or any suggestion why this issue happens? Thank you. ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 12 months

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2020