Hi,
We have a production cluster of 27 OSD's across 5 servers (all SSD's
running bluestore), and have started to notice a possible performance issue.
In order to isolate the problem, we built a single server with a single
OSD, and ran a few FIO tests. The results are puzzling, not that we were
expecting good performance on a single OSD.
In short, with a sequential write test, we are seeing huge numbers of reads
hitting the actual SSD
Key FIO parameters are:
[global]
pool=benchmarks
rbdname=disk-1
direct=1
numjobs=2
iodepth=1
blocksize=4k
group_reporting=1
[writer]
readwrite=write
iostat results are:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 105.00 4896.00 294.00 312080.00 1696.00 120.92
17.25 3.35 3.55 0.02 0.02 12.60
There are nearly ~5000 reads/second (~300 MB/sec), compared with only ~300
writes (~1.5MB/sec), when we are doing a sequential write test? The system
is otherwise idle, with no other workload.
Running the same fio test with only 1 thread (numjobs=1) still shows a high
number of reads (110).
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 1281.00 110.00 1463.00 440.00 12624.00 16.61
0.03 0.02 0.05 0.02 0.02 3.40
Can anyone kindly offer any comments on why we are seeing this behaviour?
I can understand if there's the occasional read here and there if
RocksDB/WAL entries need to be read from disk during the sequential write
test, but this seems significantly high and unusual.
FIO results (numjobs=2)
writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=rbd, iodepth=1
...
fio-3.7
Starting 2 processes
Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
01m:00s]
writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb 3 22:46:16 2020
write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
clat (msec): min=2, max=210, avg=58.32, stdev=70.54
lat (msec): min=2, max=210, avg=58.35, stdev=70.54
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 54], 60.00th=[ 62],
| 70.00th=[ 65], 80.00th=[ 174], 90.00th=[ 188], 95.00th=[ 194],
| 99.00th=[ 201], 99.50th=[ 203], 99.90th=[ 209], 99.95th=[ 209],
| 99.99th=[ 211]
bw ( KiB/s): min= 24, max= 144, per=49.69%, avg=68.08, stdev=38.22,
samples=239
iops : min= 6, max= 36, avg=16.97, stdev= 9.55, samples=239
lat (msec) : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
cpu : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
IO depths : 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
io=8228KiB (8425kB), run=60038-60038msec
Hello,
I'm a beginner on ceph. I set up three ceph clusters on google cloud.
Cluster1 has three nodes and each node has three disks. Cluster2 has three
nodes and each node has two disks. Cluster3 has five nodes and each node
has five disks.
All disks are HDD. Disk speed shown by `dd if=/dev/zero of=here bs=1G
count=1 oflag=direct` is 117MB/s.
The network is 10Gbps.
Ceph version is 12.2.12.
I found something strange:
1. When running `rados bench`, the write performance of all clusters drops
dramatically after a few minutes. I created a pool named "scbench" with
replicated size 1 (I know it is not safe but I want the highest write
speed). The write performance (shown by rados bench -p scbench 1000 write)
before and after the drop are:
cluster1: 297MB/s 94.5MB/s
cluster2: 304MB/s 67.4MB/s
cluster3: 494MB/s 267.6MB/s
It looks like the performance before the drop is nodes_num * 100MB/s, and
the performance after the drop is about osds_num * 10MB/s. I have no idea
why there is such a drop and why the performances before the drop are
linear with nodes_num.
2. The write performance of object storage (shown by swift-bench -c 64 -s
4096000 -n 100000 -g 0 swift.conf) is much lower than that of storage
cluster(shown by rados bench -p scbench 1000 write). I have set the
replicated size of "default.rgw.buckets.data" and
"default.rgw.buckets.index" to 1
The speed of cluster1 oss is 117MB/s (before the drop) and 26MB/s (after
the drop), and the speed of cluster3 oss is 118MB/s (the drop does not
happen).
Is it normal that the oss write performance is worse than rados write
performance? If not, how can I solve the problem?
Thanks!
Cluster upgraded from 12.2.12 to 14.2.5. All went smooth, except BlueFS
spillover warning.
We create OSDs with ceph-deploy, command goes like this:
ceph-deploy osd create --bluestore --data /dev/sdf --block-db /dev/sdb5
--block-wal /dev/sdb6 ceph-osd3
where block-db and block-wal are SSD partitions.
Default ceph-deploy settings created partitions ~1GB which is, of course,
too small. So we redeployed OSDs using manually partitioned SSD for
block-db/block-wal with sizes of 20G/5G respectively.
But now we still get BlueFS spillover warning for redeployed OSDs:
osd.10 spilled over 2.4 GiB metadata from 'db' device (2.8 GiB used of
19 GiB) to slow device
osd.19 spilled over 3.7 GiB metadata from 'db' device (2.7 GiB used of
19 GiB) to slow device
osd.20 spilled over 4.2 GiB metadata from 'db' device (2.6 GiB used of
19 GiB) to slow device
osd size is 1.8 TiB.
These OSDs are used primarily for RBD as a backup drives, so a lot of
snapshots held there. They also have RGW pool assigned to them, but it has
no data.
I know of sizing recommendations[1] for block-db/block-wal, but I assumed
since it's primarily RBD 1%(~20G) should be enough.
Also, compaction stats doesn't make sense to me[2]. It states that sum of
DB is only 5.08GB, that should be placed on block-db without a problem?
Am I understanding all this wrong? Should block-db size be greater in my
case?
[1]
https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/…
[2] osd.10 logs as an example
https://pastebin.com/hC6w6jSn
Servers: 6 (include 7osds) total 42osdsl
OS: Centos7
Ceph: 10.2.5
Hi, everyone
The cluster is used for VM image storage and object storage.
And I have a bucket which has more than 20 million objects.
Now, I have a problem that cluster blocks operation.
Suddenly cluster blocked operations, then VMs can't read disk.
After a few hours, osd.1 was down.
There is no disk fail messages in dmesg.
And no error is in smartctl -a /dev/sde.
I tried to wake up osd.1, but osd.1 is down soon.
Just after re-waking up osd.1, VM can access to the disk.
But osd.1 always uses 100% CPU, then cluster marked osd.1 down and the osd was dead by suicide timeout.
I found that the osdmap epoch of osd.1 is different from other one.
So I think osd.1 was dead.
Question.
(1) Why does the epoch of osd.1 differ from other osds ones ?
I checked all osds oldest_map and newest_map by ~ceph daemon osd.X status~
All osd's ecpoch are same number except osd.1
(2) Why does osd.1 use CPU full?
After the cluster marked osd.1 down, osd.1 keeps up busy.
When I execute "ceph tell osd.1 injectargs --debug-ms 5/1", osd.1 doesn't answer.
Thank you.
--
Makito
Hi,
for test purposes, I have set up two 100 GB OSDs, one
taking a data pool and the other metadata pool for cephfs.
Am running 14.2.6-1-gffd69200ad-1 with packages from
https://mirror.croit.io/debian-nautilus
Am then running a program that creates a lot of 1 MiB files by calling
fopen()
fwrite()
fclose()
for each of them. Error codes are checked.
This works successfully for ~100 GB of data, and then strangely also succeeds
for many more 100 GB of data... ??
All written files have size 1 MiB with 'ls', and thus should contain the data
written. However, on inspection, the files written after the first ~100 GiB,
are full of just 0s. (hexdump -C)
To further test this, I use the standard tool 'cp' to copy a few random-content
files into the full cephfs filessystem. cp reports no complaints, and after
the copy operations, content is seen with hexdump -C. However, after forcing
the data out of cache on the client by reading other earlier created files,
hexdump -C show all-0 content for the files copied with 'cp'. Data that was
there is suddenly gone...?
I am new to ceph. Is there an option I have missed to avoid this behaviour?
(I could not find one in
https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
Is this behaviour related to
https://docs.ceph.com/docs/mimic/cephfs/full/
?
(That page states 'sometime after a write call has already returned 0'. But if
write returns 0, then no data has been written, so the user program would not
assume any kind of success.)
Best regards,
Håkan
Hi Jake and all,
We're having what looks to be the exact same problem. In our case it
happened when I was "draining" an OSD for removal. (ceph crush
remove...) Adding the OSD back doesn't help workaround the bug.
Everything is either triply replicated or EC k3m2, either of which
should stand loss of two hosts (much less one OSD).
We're running 13.2.6 .
I tried various OSD restarts, deep-scrubs, with no change. I'm
leaving things alone hoping that croit.io will update their package to
13.2.8 soonish. Maybe that will help kick it in the pants.
Chad.
Hi,
Our large Luminous cluster still has around 2k FileStore OSDs (35% of OSDs). We haven't had any particular need to move these over to BlueStore yet, as the performance is fine for our use case. Obviously, it would be easiest if we could let the FileStore OSDs stay in the cluster until the hardware generation is removed from the cluster in a few years.
While reading around for our upgrade to Nautilus, and I stumbled across a thread [1] where Sage mentioned they had some issues getting FileStore OSDs and PG merging to play nicely. While I'm not planning on trying to merge PGs on this cluster any time soon, it raised a few more general questions about the OSD back ends in general.
- How much extra risk are we exposed to by running both types of OSDs in the same pools. So far we have not run into any issues due to this, but I get the feeling it is not a well-tested configuration compared to BlueStore only clusters.
- Are FileStore OSDs going to continue to be supported indefinitely? I know they have their uses, but if there is an aspiration to reduce support or stop supporting them entirely in the coming releases then we may need to plan around that.
Any thoughts, opinions, or pointers to previous discussions about this would be great.
Thanks,
Tom
[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GRQOYECPFTO…
[Moving this to ceph-users(a)ceph.io]
This looks like https://tracker.ceph.com/issues/43365, which *looks* like
it is an issue with the standard libraries in ubuntu 18.04. One user
said: "After upgrading our monitor Ubuntu 18.04 packages (apt-get upgrade)
with the 5.3.0-26-generic kernel, it seems that the crashes have been
fixed (they run stable now for 8 days)." Can you give that a try?
Also,
On Wed, 5 Feb 2020, Micha Ballmann wrote:
> Hi,
>
> i have a Proxmox Ceph Cluster VE 6.1-5.
>
> # ceph -v
> ceph version 14.2.6 (ba51347bdbe28c7c0e2e9172fa2983111137bb60) nautilus
> (stable)
>
> My problem since version 14.2.6 im receiving nearly everyday the following
> messages:
>
> # cephs -s
>
> ...
>
> health: HEALTH_WARN
> 2 daemons have recently crashed
>
> ...
>
> I archive the messages:
>
> # ceph crash archive-all
>
> But one or two days laters the same problem occurs. Trying to found out what
> is the problem:
>
> For example:
>
> # ceph crash info <ID>
note that this ID is a hash of the stack trace and is meant to be a
unique signature for this crash/bug but contains no identifying
information. Yours is probably one of
6a617f9d477ab8df2d068af0768ff741c68adabcc5c1ecb5dd3e9872d613c943
dacbff55030f3d0837e58d8f4961441b6902d5750b0e1579682df5650c33d44d
Please consider turning on telemetry so we get this crash information
automatically:
https://docs.ceph.com/docs/master/mgr/telemetry/
Thanks!
sage
>
> Node4
> {
> "os_version_id": "10",
> "utsname_machine": "x86_64",
> "entity_name": "mon.promo4",
> "backtrace": [
> "(()+0x12730) [0x7f30ca142730]",
> "(gsignal()+0x10b) [0x7f30c9c257bb]",
> "(abort()+0x121) [0x7f30c9c10535]",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3)
> [0x7f30cb27be79]",
> "(()+0x282000) [0x7f30cb27c000]",
> "(Paxos::store_state(MMonPaxos*)+0xaa8) [0x5602540626f8]",
> "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2ea)
> [0x560254062a5a]",
> "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223)
> [0x560254068213]",
> "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c)
> [0x560253f9db1c]",
> "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x560253f9e10a]",
> "(Monitor::ms_dispatch(Message*)+0x26) [0x560253fcda36]",
> "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x560253fc9f66]",
> "(DispatchQueue::entry()+0x1a49) [0x7f30cb4b4e69]",
> "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f30cb5629ed]",
> "(()+0x7fa3) [0x7f30ca137fa3]",
> "(clone()+0x3f) [0x7f30c9ce74cf]"
> ],
> "process_name": "ceph-mon",
> "assert_line": 485,
> "archived": "2020-01-21 07:02:49.036123",
> "assert_file":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h",
> "utsname_sysname": "Linux",
> "os_version": "10 (buster)",
> "os_id": "10",
> "assert_msg":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: In
> function 'ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f30c11fe700 time
> 2020-01-21
> 03:43:48.848411\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h:
> 485: FAILED ceph_assert(z >= signedspan::zero())\n",
> "assert_func": "ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)",
> "ceph_version": "14.2.6",
> "os_name": "Debian GNU/Linux 10 (buster)",
> "timestamp": "2020-01-21 02:43:48.891122Z",
> "assert_thread_name": "ms_dispatch",
> "utsname_release": "5.3.13-1-pve",
> "utsname_hostname": "promo4",
> "crash_id":
> "2020-01-21_02:43:48.891122Z_0aade13c-463f-43fe-9b05-76ca71f6bc1b",
> "assert_condition": "z >= signedspan::zero()",
> "utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)"
> }
>
> Node2
> {
> "os_version_id": "10",
> "utsname_machine": "x86_64",
> "entity_name": "mon.promo2",
> "backtrace": [
> "(()+0x12730) [0x7f74f6c3f730]",
> "(gsignal()+0x10b) [0x7f74f67227bb]",
> "(abort()+0x121) [0x7f74f670d535]",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3)
> [0x7f74f7d78e79]",
> "(()+0x282000) [0x7f74f7d79000]",
> "(Paxos::store_state(MMonPaxos*)+0xaa8) [0x55b9540ae6f8]",
> "(Paxos::handle_commit(boost::intrusive_ptr<MonOpRequest>)+0x2ea)
> [0x55b9540aea5a]",
> "(Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x223)
> [0x55b9540b4213]",
> "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x131c)
> [0x55b953fe9b1c]",
> "(Monitor::_ms_dispatch(Message*)+0x4aa) [0x55b953fea10a]",
> "(Monitor::ms_dispatch(Message*)+0x26) [0x55b954019a36]",
> "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26)
> [0x55b954015f66]",
> "(DispatchQueue::entry()+0x1a49) [0x7f74f7fb1e69]",
> "(DispatchQueue::DispatchThread::entry()+0xd) [0x7f74f805f9ed]",
> "(()+0x7fa3) [0x7f74f6c34fa3]",
> "(clone()+0x3f) [0x7f74f67e44cf]"
> ],
> "process_name": "ceph-mon",
> "assert_line": 485,
> "archived": "2020-01-21 07:02:49.041386",
> "assert_file":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h",
> "utsname_sysname": "Linux",
> "os_version": "10 (buster)",
> "os_id": "10",
> "assert_msg":
> "/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h: In
> function 'ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)' thread 7f74edcfb700 time
> 2020-01-20
> 22:32:56.933800\n/mnt/npool/tlamprecht/pve-ceph/ceph-14.2.6/src/common/ceph_time.h:
> 485: FAILED ceph_assert(z >= signedspan::zero())\n",
> "assert_func": "ceph::time_detail::timespan
> ceph::to_timespan(ceph::time_detail::signedspan)",
> "ceph_version": "14.2.6",
> "os_name": "Debian GNU/Linux 10 (buster)",
> "timestamp": "2020-01-20 21:32:56.947402Z",
> "assert_thread_name": "ms_dispatch",
> "utsname_release": "5.3.13-1-pve",
> "utsname_hostname": "promo2",
> "crash_id":
> "2020-01-20_21:32:56.947402Z_3ae7220c-23c9-478a-a22d-626c2fa34414",
> "assert_condition": "z >= signedspan::zero()",
> "utsname_version": "#1 SMP PVE 5.3.13-1 (Thu, 05 Dec 2019 07:18:14 +0100)"
> }
>
> Is there a problem with my NTP? Im syncing my time with CHRONY to my local NTP
> Server.
>
> It would be nice if you can help.
>
> I have to say my ceph cluster is clean and works without any issue. All OSDs
> are up and after ceph crash archive-all; cephs -s says; HEALTH_OK
>
> Regards
>
> Micha
>
>