Hi,
I want to create an optimizer plan on each pool.
My cluster has multiple crush roots, and multiple pools each
representing a specific drive (HDD, SSD, NVME).
Some pools are balanced, some are not.
Therefore I want to run optimizer to create a new plan on specific pool.
However this fails for any pool with this error message:
root@ld3955:~# ceph balancer optimize hdd-plan hdd
Error EALREADY: Unable to find further optimization, or pool(s)' pg_num
is decreasing, or distribution is already perfect
root@ld3955:~# ceph balancer optimize ssd-plan ssd
Error EALREADY: Unable to find further optimization, or pool(s)' pg_num
is decreasing, or distribution is already perfect
root@ld3955:~# ceph balancer optimize hdb_backup-plan hdb_backup
Error EALREADY: Unable to find further optimization, or pool(s)' pg_num
is decreasing, or distribution is already perfect
root@ld3955:~# ceph osd pool ls
hdb_backup
hdd
ssd
nvme
cephfs_data
cephfs_metadata
What is causing this error?
THX
Hi all,
The only recommendation I can find about db device selection is about the capacity (4% of the data disk) on the documents. Is there any suggestions about technical specs like throughput, IOPS and db device per data disk?
While designing a specific infrastructure with filestore, we were looking its specs to meet requirements of all the disks behind the journal device. But in bluestore, data is directly written into data device via bluefs adapter while metadata is written into db (RocksDB) device.
I know that it depends on the workload, but is there any best practice or recommendation about selection of db device?
IMO, using NVME disks that we used for filestore journals as db devices is not meaningful. Because NVME disks have minimal latency and extraordinary throughput and IOPS performance. So I am not sure that DB device needs that kind of performance. So I want to use those NVME disks for a full flash pool, and choose another disks for db device.
Any suggestion or recommendation would be appreciated.
Best regards,
Huseyin Cotuk
hcotuk(a)gmail.com
Hi,
I activated balancer in order to balance data distribution:
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
However, the data stored on 1.6TB HDD in specific pool "hdb_backup" is
not balanced; the range starts with
osd.265 size: 1.6 usage: 52.83 reweight: 1.00000
and ends with
osd.145 size: 1.6 usage: 80.19 reweight: 1.00000
The affected drives are located on 4 nodes.
The result is that not all available disk space is available for usage.
I have attached pastebin <https://pastebin.com/dNyEwNR0> with
- ceph osd df sorted by usage
- ceph osd df tree
Please advise how to start balancer to correct data distribution.
THX
I'm in the process of testing the iscsi target feature of ceph. The cluster
is running ceph 14.2.4 and ceph-iscsi 3.3. It consists of 5 hosts with 12
SSD OSDs per host. Some basic testing moving VMs to a ceph backed datastore
is only showing 60MB/s transfers. However moving these back off the
datastore is fast at 200-300MB/s.
What should I be looking at to track down the write performance issue? In
comparison with the Nimble Storage arrays I can see 200-300MB/s in both
directions.
Thanks,
Ryan
We had an OSD host with 13 OSDs fail today and we have a weird blocked
OP message that I can't understand. There are no OSDs with blocked
ops, just `mon` (multiple times), and some of the rgw instances.
cluster:
id: 570bcdbb-9fdf-406f-9079-b0181025f8d0
health: HEALTH_WARN
1 large omap objects
Degraded data redundancy: 2083023/195702437 objects
degraded (1.064%), 880 pgs degraded, 880 pgs undersized
1609 pgs not deep-scrubbed in time
4 slow ops, oldest one blocked for 506699 sec, daemons
[mon,sun-gcs02-rgw01,mon,sun-gcs02-rgw02,mon,sun-gcs02-rgw03] have
slow ops.
services:
mon: 3 daemons, quorum
sun-gcs02-rgw01,sun-gcs02-rgw02,sun-gcs02-rgw03 (age 6m)
mgr: sun-gcs02-rgw02(active, since 5d), standbys: sun-gcs02-rgw03,
sun-gcs02-rgw04
osd: 767 osds: 754 up (since 10m), 754 in (since 104m); 880 remapped pgs
rgw: 16 daemons active (sun-gcs02-rgw01.rgw0, sun-gcs02-rgw01.rgw1,
sun-gcs02-rgw01.rgw2, sun-gcs02-rgw01.rgw3, sun-gcs02-rgw02.rgw0,
sun-gcs02-rgw02.rgw1, sun-gcs02-rgw02.rgw2, sun-gcs02-rgw02.rgw3,
sun-gcs02-rgw03.rgw0, sun-gcs02-rgw03.rgw1, sun-gcs02-rgw03.rgw2, s
un-gcs02-rgw03.rgw3, sun-gcs02-rgw04.rgw0, sun-gcs02-rgw04.rgw1,
sun-gcs02-rgw04.rgw2, sun-gcs02-rgw04.rgw3)
data:
pools: 7 pools, 8240 pgs
objects: 19.57M objects, 52 TiB
usage: 88 TiB used, 6.1 PiB / 6.2 PiB avail
pgs: 2083023/195702437 objects degraded (1.064%)
43492/195702437 objects misplaced (0.022%)
7360 active+clean
868 active+undersized+degraded+remapped+backfill_wait
12 active+undersized+degraded+remapped+backfilling
io:
client: 150 MiB/s rd, 642 op/s rd, 0 op/s wr
recovery: 626 MiB/s, 223 objects/s
$ ceph versions
{
"mon": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754
},
"mds": {},
"rgw": {
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 16
},
"overall": {
"ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be)
nautilus (stable)": 754,
"ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
nautilus (stable)": 22
}
}
I restarted one of the monitors and it dropped out of the list only
showing 2 blocked ops, but then showed up again a little while later.
Any ideas on where to look?
Thanks,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Hi,
after enabling ceph balancer (with command ceph balancer on) the health
status changed to error.
This is the current output of ceph health detail:
root@ld3955:~# ceph health detail
HEALTH_ERR 1438 slow requests are blocked > 32 sec; 861 stuck requests
are blocked > 4096 sec; mon ld5505 is low on available space
REQUEST_SLOW 1438 slow requests are blocked > 32 sec
683 ops are blocked > 2097.15 sec
436 ops are blocked > 1048.58 sec
191 ops are blocked > 524.288 sec
78 ops are blocked > 262.144 sec
35 ops are blocked > 131.072 sec
11 ops are blocked > 65.536 sec
4 ops are blocked > 32.768 sec
osd.62 has blocked requests > 65.536 sec
osds 39,72 have blocked requests > 262.144 sec
osds 6,19,67,173,174,187,188,269,434 have blocked requests > 524.288 sec
osds
8,16,35,36,37,61,63,64,68,73,75,178,186,271,369,420,429,431,433,436 have
blocked requests > 1048.58 sec
osds 3,5,7,24,34,38,40,41,59,66,69,74,180,270,370,421,432,435 have
blocked requests > 2097.15 sec
REQUEST_STUCK 861 stuck requests are blocked > 4096 sec
25 ops are blocked > 8388.61 sec
836 ops are blocked > 4194.3 sec
osds 2,28,29,32,60,65,181,185,268,368,423,424,426 have stuck
requests > 4194.3 sec
osds 0,30,70,71,184 have stuck requests > 8388.61 sec
I understand that when balancer starts shifting PGs to other OSDs that
this caused IO load on the cluster.
However I don't understand why this is affecting OSD so heavily.
And I don't understand why OSD of specific type (SSD, NVME) suffer
although there's no balancing occuring on them.
Regards
Thomas
Hello everybody!
What does this mean?
health: HEALTH_WARN
1 subtrees have overcommitted pool target_size_bytes
1 subtrees have overcommitted pool target_size_ratio
and what does it have to do with the autoscaler?
When I deactivate the autoscaler the warning goes away.
$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
cephfs_metadata 15106M 3.0 2454G 0.0180 0.3000 4.0 256 on
cephfs_data 113.6T 1.5 165.4T 1.0306 0.9000 1.0 512 on
$ ceph health detail
HEALTH_WARN 1 subtrees have overcommitted pool target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio
POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_bytes
Pools ['cephfs_data'] overcommit available storage by 1.031x due to target_size_bytes 0 on pools []
POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool target_size_ratio
Pools ['cephfs_data'] overcommit available storage by 1.031x due to target_size_ratio 0.900 on pools ['cephfs_data']
Thanks
Lars
Hi,
Is anyone using librados AIO APIs? I seem to have a problem with that where
the rados_aio_wait_for_complete() call just waits for a long period of time
before it finishes without error.
More info on my setup:
I am using Ceph 14.2.4 and write 8MB objects.
I run my AIO program on 24 nodes at the same time each writing a different
data (splits into 8MB objects and ), each data is about 2G.
Normally, it takes about 10 mins for all of them to complete. But often one
or more nodes takes considerably longer to finish. When looking at the one
of those, I mostly see that the IO requests have been submitted and waits
at:
#0 pthread_cond_wait@@GLIBC_2.3.2 () at
../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00002aaaaad0c8fa in rados_aio_wait_for_complete () from
/cgv/geovation/2/test/ceph/lib/librados.so.2
Then it eventually completes with no errors from
rados_aio_wait_for_complete() call.
The (pseudo) code looks like:
while (data remains to be written) {
size_t aio_ops_count = 0;
rados_completion_t aio_comp[12];
for (size_t j = 0; j < 12; ++j) {
int err = rados_aio_create_completion(NULL, NULL, NULL,
&aio_comp[j]);
if (err < 0) {
cerr << "rados_aio_create_completion: " <<
strerror(-err) << endl;
return 1;
}
string obj_ = getobjectid();
err = rados_aio_write_full(io, obj_.c_str(), aio_comp[j],
read_buf[j], bytes);
if (err < 0) {
cerr << "rados_write_full: " << strerror(-err) << endl;
return 1;
}
++aio_ops_count;
}
for (size_t j = 0; j < aio_ops_count; ++j) {
rados_aio_wait_for_complete(aio_comp[j]);
int err = rados_aio_get_return_value(aio_comp[j]); //
Considerably longer delay here ??
if (err < 0) {
cerr << "rados_aio_get_return_value: " <<
strerror(-err) << endl;
return 1;
}
rados_aio_release(aio_comp[j]);
}
}
I ran under Valgrind and see no issues and also read the data back and
checksum it to verify no corruption issues. So everything appears to "work"
as expected except for longer delays at times.
Wondering if anyone is using the AIO APIs to write objects and had
experienced any similar problems.
Please let me know if you need further information.
(Originally posted this to dev(a)ceph.io and on Daniel's suggestion, I am
posting here).
Regards,
Ponnuvel P
--
Regards,
Ponnuvel P
hi ceph-users,
i have a cluster run ceph object using version 14.2.1. I want to creat 2
pool for bucket data for purposes for security:
+ one bucket-data pool for public client access from internet (name
*zone1.rgw.buckets.data-pub) *
+ one bucket-data pool for private client access from local network (name
*zone1.rgw.buckets.data-pub)*
each pool bucket-data has one individual access key: access key public
(access pool public) and access key private (access pool private).
Can you give me a recomment for this or bestpractice that you've done? what
needs to be done?
Or give me your best solution for securiy a cluster ceph object with
public client access and private client access?
Thank you very much
Br,
----------------------------------------------
Dương Tuấn Dũng
Email: dungdt.aicgroup(a)gmail.com
ĐT: 0986153686