I'm getting similar errors after rebooting a node. Cluster was upgraded
15.2.1 -> 15.2.2 yesterday. No problems after rebooting during upgrade.
On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from
both. Logs from one below.
Neither OSDs have compression enabled, although there is a
compression-related error in the log.
Both are replicated x3. One has data on HDD & separate WAL/DB on NVMe
partition, the other is everything on NVMe partition only.
Feeling kinda nervous here - advice welcomed!!
Thx, Chris
2020-05-20T13:14:00.837+0100 7f2e0d273700 3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch: expected 0, got 3423870535 in db/000304.sst offset
18446744073709551615 size 18446744073709551615
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb:
[db/version_set.cc:3757] Recovered from manifest file:db/MANIFEST-000312
succeeded,manifest_file_number is 312, next_file_number is 314,
last_sequence is 22320582, log_number is 309,prev_log_number is
0,max_column_family is 0,min_log_number_to_keep is 0
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb:
[db/version_set.cc:3766] Column family [default] (ID 0), log number is 309
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb: EVENT_LOG_v1
{"time_micros": 1589976840843199, "job": 1, "event": "recovery_started",
"log_files": [313]}
2020-05-20T13:14:00.841+0100 7f2e1957ee00 4 rocksdb:
[db/db_impl_open.cc:583] Recovering log #313 mode 0
2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb:
[db/db_impl_open.cc:518] db.wal/000313.log: dropping 9044 bytes;
Corruption: error in middle of record
2020-05-20T13:14:00.937+0100 7f2e1957ee00 3 rocksdb:
[db/db_impl_open.cc:518] db.wal/000313.log: dropping 86 bytes;
Corruption: missing start of fragmented record(2)
2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb:
[db/db_impl.cc:390] Shutdown: canceling all background work
2020-05-20T13:14:00.937+0100 7f2e1957ee00 4 rocksdb:
[db/db_impl.cc:563] Shutdown complete
2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1 rocksdb: Corruption: error
in middle of record
2020-05-20T13:14:00.937+0100 7f2e1957ee00 -1
bluestore(/var/lib/ceph/osd/ceph-9) _open_db erroring opening db:
2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bluefs umount
2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 fbmap_alloc 0x55daf2b3a900
shutdown
2020-05-20T13:14:00.937+0100 7f2e1957ee00 1 bdev(0x55daf3838700
/var/lib/ceph/osd/ceph-9/block) close
2020-05-20T13:14:01.093+0100 7f2e1957ee00 1 bdev(0x55daf3838000
/var/lib/ceph/osd/ceph-9/block) close
2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 osd.9 0 OSD:init: unable to
mount object store
2020-05-20T13:14:01.341+0100 7f2e1957ee00 -1 ESC[0;31m ** ERROR: osd
init failed: (5) Input/output errorESC[0m
Thnx Anthony
on a 5 node cluster, replica 5, 5 MON, failure domain is host:
> > any two nodes might fail without leaving the cluster
> > unresponsive?
>
> No. Assuming again that your failure domain is “host”, some PGs will have two copies on these two nodes, so they will be undersized until redundancy is restored.
>
ouch, right,
but aren't PGs checksummed so from the remaining PG (given the
checksum would be right) two new copies could be created?
Hi all,
stumbling over a new ceph cluster setup, I got a basic question
regarding the behaviour of ceph.
The cluster I found runs 4 hardware nodes as hyper convergent instances -
3 nodes running MON and several OSD instances while one node only runs
several OSD.
At the same time, all nodes serve as KVM hosts, running several VMs.
Replication is set to three.
1.) Am I right that in this setup only one node may fail without
leaving the cluster unresponsive?
(i.e. rejecting further IO)?
if two nodes fail, the cluster will stop serving IO (but probably
won't lose data if at least one of the two failing nodes can recover)
2.) Am I further right, that this setup with four nodes might provide
a little better IO to the VMs than if it would run on only three nodes
(all of whom run OSD and OSD and KVM), but
3.) in a setup with 5 nodes, the IO performance delivered to the VMs
would be even better and if all of the 5 nodes would run MON and OSDs
and KVM, any two nodes might fail without leaving the cluster
unresponsive? (sure one would have to move/restart the VMs on the
failing nodes ... just from the underlying data integrity point of
view)
Sorry if it sounds dumb, seems sort of obvious to me, but I have to
nail this cristal clear.
Short, comprehensive answer would be most appreciated!
Best regards!
I've just set up a Ceph cluster and I'm accessing it via object gateway with S3 API.
One thing I don't see documented anywhere is - how does Ceph performance scale with S3 key prefixes?
In AWS S3, performance scales linearly with key prefix (see: https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html). I see the keys as a nested hash table or nodes of a prefix tree, where each prefix is stored in closer proximity at a hardware level - you want to spread reads evenly over prefixes to avoid parallel I/O being concentrated on the same hot spots.
So for example if my access pattern regularly involves scanning data through multiple dates for a single city, this key structure will be more effective: `yyyymmdd/city/data.csv`. Whereas if my access pattern involves scanning through different cities on a single date, `city/yyyymmdd/data.csv` would be more effective.
How about Ceph? Does naming convention of the key prefixes have an effect on Ceph's object gateway performance or does it treat the full object "paths" as a completely flat namespace?
Hi,
I'm planning to install a new Ceph cluster (Nautilus) using 8+3 EC,
SATA-only storage. We want to store here only big files (from 40-50MB to
200-300GB each). The write load will be higher than the read load .
I was thinking at the following Bluestore config to reduce the load
on the drives as much as possible and also to have a good storage
efficiency :
- stripe_unit for Jerasure EC profile = 256K
- bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M
- rados obj size = 4M
- rgw_max_chunk_size=4M = rgw_obj_stripe_size
What do you think ? Is there any gain in using a higher
rgw_max_chunk_size and/or rgw_obj_stripe_size (for example 20M instead
of 4M) ? What about bluestore_min_alloc_size and bluefs_alloc_size ?
Should I lower them to 512K or 256K ?
Thanks,
Adrian
Hi,
I have a health warn regarding pool full:
health: HEALTH_WARN
1 pool(s) full
This is the pool that is complaining:
Ceph df:
NAME ID USED %USED MAX AVAIL OBJECTS
k8s 8 200GiB 0.22 90.5TiB 51860
Rados df:
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
k8s-infraops 200GiB 51860 0 155580 0 0 0 1256337 126GiB 19012858 944GiB
What needs to be cleaned to avoid the pool full?
Ceph version luminous 12.2.8.
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi Sebastian,
I did not get your reply via e-mail. I am very sorry for this. I hope you can see this message...
I've re-run the upgrade and attached the log.
Thanks,
Gencer.
Hello everyone :)
When I try to create an OSD, Proxmox UI asks for
* Data disk
* DB disk
* WAL disk
What disk will be the limiting factor in terms of storage size for my
OSD - the data disk?
How large do I need to make the other two?
Is there a risk of them running over capacity before the capacity of the
OSD is reached? If so, how to avoid this (in terms of usage pattern)?
Have a nice day :)
Hi list,
Looking into diskprediction_local module, and I see that it only
predicts a few states: good, warning and bad:
ceph/src/pybind/mgr/diskprediction_local/predictor.py:
if score > 10:
return "Bad"
if score > 4:
return "Warning"
return "Good"
The predicted fail date is just a derivative based on that:
ceph/src/pybind/mgr/diskprediction_local/module.py
if result.lower() == 'good':
life_expectancy_day_min = (TIME_WEEK * 6) + TIME_DAYS
life_expectancy_day_max = None
elif result.lower() == 'warning':
life_expectancy_day_min = (TIME_WEEK * 2)
life_expectancy_day_max = (TIME_WEEK * 6)
elif result.lower() == 'bad':
life_expectancy_day_min = 0
life_expectancy_day_max = (TIME_WEEK * 2) - TIME_DAYS
Is it possible to get any finer prediction date?
--
Vytenis
Hi,
I have in one of my cluster a large omap object under luminous 12.2.8.
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'default.rgw.log'
Search the cluster log for 'Large omap object found' for more details.
In my setup the ha-proxy load balancer is running on 3 osd nodes which is monitor and manager also.
When I look for this large omap object, this is the one:
for i in `ceph pg ls-by-pool default.rgw.log | tail -n +2 | awk '{print $1}'`; do echo -n "$i: "; ceph pg $i query |grep num_large_omap_objects | head -1 | awk '{print $2}'; done | grep ": 1"
4.d: 1
I found only this way to reduce the size:
radosgw-admin usage trim --end-date=2019-05-01 --yes-i-really-mean-it
And I ran this many times like months by months.
However when this is running the RGW became completely unreachable, the loadbalancer started to flapping and users started to complain because they can't do anything.
Is there any other way to fix it, or any suggestion why this issue happens?
Thank you.
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.