Hi Frank,
On 7/31/2020 10:31 AM, Frank Schilder wrote:
Hi Igor,
thanks. I guess the problem with finding the corresponding images is, that it happens on
bluestore and not on object level. Even if I listed all rados objects and added their
sizes I would not see the excess storage.
Thinking about working around this issue, would re-writing the objects deflate the exces
usage? For example, evacuating an OSD and adding it back to the pool after it was empty,
would this re-write the objects on this OSD without the overhead?
May be but I
can't say for sure..
Or simply copying an entire RBD image, would the copy be deflated?
Although the latter options sound a bit crazy, one could do this without (much) downtime
of VMs and it might get us through this migration.
Also you might want to try pg export/import using ceph-objectstore-tool.
See
https://ceph.io/geen-categorie/incomplete-pgs-oh-my/ for some hints
how to do that.
But again I'm not certain if it's helpful. Preferably to try with some
non-production cluster first...
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <ifedotov(a)suse.de>
> Sent: 30 July 2020 15:40
> To: Frank Schilder; ceph-users
> Subject: Re: [ceph-users] mimic: much more raw used than reported
>
> Hi Frank,
>
> On 7/30/2020 11:19 AM, Frank Schilder wrote:
>> Hi Igor,
>>
>> thanks for looking at this. Here a few thoughts:
>>
>> The copy goes to NTFS. I would expect between 2-4 meta data operations per write,
which would go to few existing objects. I guess the difference
bluestore_write_small-bluestore_write_small_new are mostly such writes and are susceptible
to the partial overwrite amplification. A first question is, how many objects are actually
affected? 30000 small writes does not mean 30000 objects have partial overwrites.
>>
>> The large number of small_new is indeed strange, although these would not lead to
excess allocations. It is possible that the write size of the copy tool is not ideal, was
wondering about this too. I will investigate.
> small_new might relate to small tailing chunks that presumably appear
> when doing unaligned appends. Each such append triggers small_new write...
>
>
>> To know more, I would need to find out which images these small writes come from,
we have more than one active. Is there a low-level way to find out which objects are
affected by partial overwrites and which image they belong to? In your post you were
describing some properties like being shared/cloned etc. Can one search for such objects?
> IMO raising debug bluestore to 10 (or even 20) and subsequent OSD log
> inspection is likely to be the only mean to learn which objects OSD is
> processing... Be careful - this produces significant amount of data and
> negatively impact the performance.
>> On a more fundamental level, I'm wondering why RBD images issue sub-object
size writes at all. I naively assumed that every I/O operation to RBD always implies full
object writes, even just changing a single byte (thinking of an object as the equivalent
of a sector on a disk, the smallest atomic unit). If this is not the case, what is the
meaning of object size then? How does it influence on I/O patterns? My benchmarks show
that object size matters a lot, but it becomes a bit unclear now why.
> Not sure I can provide good enough answer on the above. But I doubt that
> RBD unconditionally operates on full objects.
>
>
>> Thanks and best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Igor Fedotov <ifedotov(a)suse.de>
>> Sent: 29 July 2020 16:25:36
>> To: Frank Schilder; ceph-users
>> Subject: Re: [ceph-users] mimic: much more raw used than reported
>>
>> Frank,
>>
>> so you have pretty high amount of small writes indeed. More than a half
>> of the written volume (in bytes) is done via small writes.
>>
>> And 6x times more small requests.
>>
>>
>> This looks pretty odd for sequential write pattern and is likely to be
>> the root cause for that space overhead.
>>
>> I can see approx 1.4GB additionally lost per each of these 3 OSDs since
>> perf dump reset ( = allocated_new - stored_new - (allocated_old -
>> stored_old))
>>
>> Below are some speculations on what might be happening by for sure I
>> could be wrong/missing something. So please do not consider this as a
>> 100% valid analysis.
>>
>> Client does writes in 1MB chunks. This is split into 6 EC chunks (+2
>> added) which results in approx 170K writing block to object store ( =
>> 1MB / 6). Which corresponds to 1x128K big write and 1x42K small tailing
>> one. Resulting in 3x64K allocations.
>>
>> The next client adjacent write results in another 128K blob, one more
>> "small" tailing blob and heading blob which partially overlaps with
the
>> previous tailing 42K chunk. Overlapped chunks are expected to be merged.
>> But presumably this doesn't happen due to that "partial EC
overwrites"
>> issue. So instead additional 64K blob is allocated for overlapped range.
>>
>> I.e. 2x170K writes cause 2x128K blobs, 1x64K tailing blob and 2x64K
>> blobs for the range where two writes adjoined. 64K wasted!
>>
>> And similarly +64K space overhead per each additional append to this object.
>>
>>
>> Again I'm not completely sure the above analysis is 100% valid and this
>> doesn't explain that large amount of small requests. But you might want
>> to check/tune/experiment on client writing size. E.g. increase it to 4M
>> if it' less or make divisible by 6.
>>
>> Hope this helps.
>>
>> Thanks,
>>
>> Igor
>>
>> On 7/29/2020 4:06 PM, Frank Schilder wrote:
>>
>>> Hi Igor,
>>>
>>> thanks! Here a sample extract for one OSD, time stamp (+%F-%H%M%S) in file
name. For the second collection I let it run for about 10 minutes after reset:
>>>
>>> perf_dump_2020-07-29-142739.osd181: "bluestore_write_big":
10216689,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_big_bytes": 992602882048,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_big_blobs": 10758603,
>>> perf_dump_2020-07-29-142739.osd181: "bluestore_write_small":
63863813,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_small_bytes": 1481631167388,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_small_unused": 17279108,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_small_deferred": 13629951,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_small_pre_read": 13629951,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_write_small_new": 32954754,
>>> perf_dump_2020-07-29-142739.osd181:
"compress_success_count": 1167212,
>>> perf_dump_2020-07-29-142739.osd181:
"compress_rejected_count": 1493508,
>>> perf_dump_2020-07-29-142739.osd181: "bluestore_compressed":
149993487447,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_compressed_allocated": 206610432000,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_compressed_original": 362672914432,
>>> perf_dump_2020-07-29-142739.osd181:
"bluestore_extent_compress": 24431903,
>>>
>>> perf_dump_2020-07-29-143836.osd181: "bluestore_write_big":
10736,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_big_bytes": 1363214336,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_big_blobs": 12291,
>>> perf_dump_2020-07-29-143836.osd181: "bluestore_write_small":
67527,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_small_bytes": 1591140352,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_small_unused": 17528,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_small_deferred": 13854,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_small_pre_read": 13854,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_write_small_new": 36145,
>>> perf_dump_2020-07-29-143836.osd181:
"compress_success_count": 1641,
>>> perf_dump_2020-07-29-143836.osd181:
"compress_rejected_count": 2341,
>>> perf_dump_2020-07-29-143836.osd181: "bluestore_compressed":
150044304023,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_compressed_allocated": 206654210048,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_compressed_original": 362729676800,
>>> perf_dump_2020-07-29-143836.osd181:
"bluestore_extent_compress": 24979,
>>>
>>> If necessary, the full outputs for 3 OSDs can be found here:
>>>
>>> Before reset:
>>>
>>>
https://pastebin.com/zNgRwuNv
>>>
https://pastebin.com/NDzdbhWc
>>>
https://pastebin.com/mpra6PAS
>>>
>>> After reset:
>>>
>>>
https://pastebin.com/Ywrwscea
>>>
https://pastebin.com/sLjxK1Jw
>>>
https://pastebin.com/ik3n7Xtz
>>>
>>> I do see an unreasonable number of small (re-)writes with average size of ca.
20K, seems not to be due to compression. Unfortunately, I can't see anything about
alignment of writes.
>>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Igor Fedotov <ifedotov(a)suse.de>
>>> Sent: 29 July 2020 14:04:34
>>> To: Frank Schilder; ceph-users
>>> Subject: Re: [ceph-users] mimic: much more raw used than reported
>>>
>>> Hi Frank,
>>>
>>> you might want to proceed with perf counters' dump analysis in the
>>> following way:
>>>
>>> For 2-3 arbitrary osds
>>>
>>> - save current perf counter dump
>>>
>>> - reset perf counters
>>>
>>> - leave OSD under the regular load for a while.
>>>
>>> - dump perf counters again
>>>
>>> - share both saved and new dumps and/or check stats on 'big' writes
vs.
>>> 'small' ones.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>> On 7/29/2020 2:49 PM, Frank Schilder wrote:
>>>
>>>> Dear Igor,
>>>>
>>>> please find below data from "ceph osd df tree" and per-OSD
bluestore stats pasted together with the script for extraction for reference. We have
now:
>>>>
>>>> df USED: 142 TB
>>>> bluestore_stored: 190.9TB (142*8/6 = 189, so matches)
>>>> bluestore_allocated: 275.2TB
>>>> osd df tree USE: 276.1 (so matches with bluestore_allocated as well)
>>>>
>>>> The situation has gotten worse, the mismatch of raw used to stored is now
85TB. Compression is almost irrelevant. This matches with my earlier report with data
taken from "ceph osd df tree" alone. Compared with my previous report, what I
seem to see is that a sequential write of 22TB (user data) causes an excess of 16TB (raw).
This does not make sense and is not explained with the partial overwrite amplification you
referred me to.
>>>>
>>>> The real question I still have is how can I find out how much of the
excess usage is attributed to the issue you pointed me to, and how much might be due to
something else. I would probably need a way to find objects that are affected by partial
overwrite amplification and account for their total to see how much of the excess they
explain. Ideally allowing me to identify the RBD images responsible.
>>>>
>>>> I do *not* believe that *all* this extra usage is due to the partial
overwrite amplification. We do not have the use case simulated with the subsequent dd
commands in your post
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OHPO43J54TPBEUISYC…,
overwriting old data with an offset. On these images, we store very large files (15GB)
that are written *only* *once* and not modified again. We currently do nothing else but
sequential writes to a file system.
>>>>
>>>> The only objects that might see a partial overwrite could be at the tail
of such a file, when the beginning of a new file is written to an object that already
holds a tail, and potentially objects holding file system meta data. With an RBD object
size of 4M, this amounts to a comparably small number of objects that almost certainly
cannot explain the observed 44% excess even assuming worst case amplification.
>>>>
>>>> The data:
>>>>
>>>> NAME ID USED %USED MAX AVAIL
OBJECTS
>>>> sr-rbd-data-one-hdd 11 142 TiB 71.12 58 TiB
37415413
>>>>
>>>> osd df tree blue stats
>>>> ID SIZE USE alloc store
>>>> 84 8.9 6.2 6.1 4.3
>>>> 145 8.9 5.6 5.5 3.7
>>>> 156 8.9 6.3 6.2 4.2
>>>> 168 8.9 6.1 6.0 4.1
>>>> 181 8.9 6.6 6.6 4.4
>>>> 74 8.9 5.2 5.2 3.7
>>>> 144 8.9 5.9 5.9 4.0
>>>> 157 8.9 6.6 6.5 4.5
>>>> 169 8.9 6.4 6.3 4.4
>>>> 180 8.9 6.6 6.6 4.5
>>>> 60 8.9 5.7 5.6 4.0
>>>> 146 8.9 5.9 5.8 4.0
>>>> 158 8.9 6.7 6.7 4.6
>>>> 170 8.9 6.5 6.5 4.4
>>>> 182 8.9 5.8 5.7 4.0
>>>> 63 8.9 5.8 5.8 4.1
>>>> 148 8.9 6.5 6.4 4.4
>>>> 159 8.9 4.9 4.9 3.3
>>>> 172 8.9 6.4 6.3 4.4
>>>> 183 8.9 6.5 6.4 4.4
>>>> 229 8.9 5.6 5.6 3.8
>>>> 232 8.9 6.3 6.2 4.3
>>>> 235 8.9 5.0 4.9 3.3
>>>> 238 8.9 6.6 6.5 4.4
>>>> 259 11 7.5 7.4 5.1
>>>> 231 8.9 6.2 6.1 4.2
>>>> 233 8.9 6.7 6.6 4.5
>>>> 236 8.9 6.3 6.2 4.2
>>>> 239 8.9 5.2 5.1 3.5
>>>> 263 11 6.5 6.5 4.4
>>>> 228 8.9 6.3 6.3 4.3
>>>> 230 8.9 6.0 5.9 4.0
>>>> 234 8.9 6.5 6.4 4.4
>>>> 237 8.9 6.0 5.9 4.1
>>>> 260 11 6.6 6.5 4.5
>>>> 0 8.9 6.3 6.3 4.3
>>>> 2 8.9 6.4 6.4 4.5
>>>> 72 8.9 5.4 5.4 3.7
>>>> 76 8.9 6.2 6.1 4.3
>>>> 86 8.9 5.6 5.5 3.9
>>>> 1 8.9 6.0 5.9 4.1
>>>> 3 8.9 5.7 5.7 4.0
>>>> 73 8.9 6.1 6.0 4.3
>>>> 85 8.9 6.8 6.7 4.6
>>>> 87 8.9 6.1 6.1 4.3
>>>> SUM 406.8 276.1 275.2 190.9
>>>>
>>>> The script:
>>>>
>>>> #!/bin/bash
>>>>
>>>> format_TB() {
>>>> tmp=$(($1/1024))
>>>> echo "${tmp}.$(( (10*($1-tmp*1024))/1024 ))"
>>>> }
>>>>
>>>> blue_stats() {
>>>> al_tot=0
>>>> st_tot=0
>>>> printf "%12s\n" "blue stats"
>>>> printf "%5s %5s\n" "alloc"
"store"
>>>> for o in "$@" ; do
>>>> host_ip="$(ceph osd find "$o" | jq -r
'.ip' | cut -d ":" -f1)"
>>>> bs_data="$(ssh "$host_ip" ceph daemon
"osd.$o" perf dump | jq '.bluestore')"
>>>> bs_alloc=$(( $(echo "$bs_data" | jq
'.bluestore_allocated') /1024/1024/1024 ))
>>>> al_tot=$(( $al_tot+$bs_alloc ))
>>>> bs_store=$(( $(echo "$bs_data" | jq
'.bluestore_stored') /1024/1024/1024 ))
>>>> st_tot=$(( $st_tot+$bs_store ))
>>>> printf "%5s %5s\n" "$(format_TB
$bs_alloc)" "$(format_TB $bs_store)"
>>>> done
>>>> printf "%5s %5s\n" "$(format_TB $al_tot)"
"$(format_TB $st_tot)"
>>>> }
>>>>
>>>> df_tree_data="$(ceph osd df tree | sed -e "s/ *$//g" |
awk 'BEGIN {printf("%18s\n", "osd df tree")} /root default/ {o=0}
/datacenter ServerRoom/ {o=1} (o==1 && $2=="hdd")
{s+=$5;u+=$7;printf("%4s %5s %5s\n", $1, $5, $7)} f==0 {printf("%4s %5s
%5s\n", $1, $5, $6);f=1} END {printf("%4s %5.1f %5.1f\n",
"SUM", s, u)}')"
>>>>
>>>> OSDS=( $(echo "$df_tree_data" | tail -n +3 | awk '/SUM/
{next} {print $1}') )
>>>>
>>>> bs_data="$(blue_stats "${OSDS[@]}")"
>>>>
>>>> paste -d " " <(echo "$df_tree_data") <(echo
"$bs_data")
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Igor Fedotov <ifedotov(a)suse.de>
>>>> Sent: 27 July 2020 13:31
>>>> To: Frank Schilder; ceph-users
>>>> Subject: Re: [ceph-users] mimic: much more raw used than reported
>>>>
>>>> Frank,
>>>>
>>>> suggest to start with perf counter analysis as per the second part of my
>>>> previous email...
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 7/27/2020 2:30 PM, Frank Schilder wrote:
>>>>> Hi Igor,
>>>>>
>>>>> thanks for your answer. I was thinking about that, but as far as I
understood, to hit this bug actually requires a partial rewrite to happen. However, these
are disk images in storage servers with basically static files, many of which very large
(15GB). Therefore, I believe, the vast majority of objects is written to only once and
should not be affected by the amplification bug.
>>>>>
>>>>> Is there any way to confirm/rule out that/check how much
amplification is happening?
>>>>>
>>>>> I'm wondering if I might be observing something else. Since
"ceph osd df tree" does report the actual utilization and I have only one pool
on these OSDs, there is no problem with accounting allocated storage to a pool. I know its
all used by this one pool. I'm more wondering if its not the known amplification but
something else (at least partly) that plays a role here.
>>>>>
>>>>> Thanks and best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Igor Fedotov <ifedotov(a)suse.de>
>>>>> Sent: 27 July 2020 12:54:02
>>>>> To: Frank Schilder; ceph-users
>>>>> Subject: Re: [ceph-users] mimic: much more raw used than reported
>>>>>
>>>>> Hi Frank,
>>>>>
>>>>> you might be being hit by
https://tracker.ceph.com/issues/44213
>>>>>
>>>>> In short the root causes are significant space overhead due to high
>>>>> bluestore allocation unit (64K) and EC overwrite design.
>>>>>
>>>>> This is fixed for upcoming Pacific release by using 4K alloc unit but
it
>>>>> is unlikely to be backported to earlier releases due to its
complexity.
>>>>> To say nothing about the need for OSD redeployment. Hence please
expect
>>>>> no fix for mimic.
>>>>>
>>>>>
>>>>> And your raw usage reports might still be not that good since mimic
>>>>> lacks per-pool stats collection
https://github.com/ceph/ceph/pull/19454.
>>>>> I.e. your actual raw space usage is higher than reported. To
estimate
>>>>> proper raw usage one can use bluestore perf counters (namely
>>>>> bluestore_stored and bluestore_allocated). Summing
bluestore_allocated
>>>>> over all involved OSDs will give actual RAW usage. Summing
>>>>> bluestore_stored will provide actual data volume after EC
processing,
>>>>> i.e. presumably it should be around 158TiB.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>> On 7/26/2020 8:43 PM, Frank Schilder wrote:
>>>>>> Dear fellow cephers,
>>>>>>
>>>>>> I observe a wired problem on our mimic-13.2.8 cluster. We have an
EC RBD pool backed by HDDs. These disks are not in any other pool. I noticed that the
total capacity (=USED+MAX AVAIL) reported by "ceph df detail" has shrunk
recently from 300TiB to 200TiB. Part but by no means all of this can be explained by
imbalance of the data distribution.
>>>>>>
>>>>>> When I compare the output of "ceph df detail" and
"ceph osd df tree", I find 69TiB raw capacity used but not accounted for; see
calculations below. These 69TiB raw are equivalent to 20% usable capacity and I really
need it back. Together with the imbalance, we loose about 30% capacity.
>>>>>>
>>>>>> What is using these extra 69TiB and how can I get it back?
>>>>>>
>>>>>>
>>>>>> Some findings:
>>>>>>
>>>>>> These are the 5 largest images in the pool, accounting for a
total of 97TiB out of 119TiB usage:
>>>>>>
>>>>>> # rbd du :
>>>>>> NAME PROVISIONED USED
>>>>>> one-133 25 TiB 14 TiB
>>>>>> NAME PROVISIONED USED
>>>>>> one-153@222 40 TiB 14 TiB
>>>>>> one-153@228 40 TiB 357 GiB
>>>>>> one-153@235 40 TiB 797 GiB
>>>>>> one-153@241 40 TiB 509 GiB
>>>>>> one-153@242 40 TiB 43 GiB
>>>>>> one-153@243 40 TiB 16 MiB
>>>>>> one-153@244 40 TiB 16 MiB
>>>>>> one-153@245 40 TiB 324 MiB
>>>>>> one-153@246 40 TiB 276 MiB
>>>>>> one-153@247 40 TiB 96 MiB
>>>>>> one-153@248 40 TiB 138 GiB
>>>>>> one-153@249 40 TiB 1.8 GiB
>>>>>> one-153@250 40 TiB 0 B
>>>>>> one-153 40 TiB 204 MiB
>>>>>> <TOTAL> 40 TiB 16 TiB
>>>>>> NAME PROVISIONED USED
>>>>>> one-391@3 40 TiB 432 MiB
>>>>>> one-391@9 40 TiB 26 GiB
>>>>>> one-391@15 40 TiB 90 GiB
>>>>>> one-391@16 40 TiB 0 B
>>>>>> one-391@17 40 TiB 0 B
>>>>>> one-391@18 40 TiB 0 B
>>>>>> one-391@19 40 TiB 0 B
>>>>>> one-391@20 40 TiB 3.5 TiB
>>>>>> one-391@21 40 TiB 5.4 TiB
>>>>>> one-391@22 40 TiB 5.8 TiB
>>>>>> one-391@23 40 TiB 8.4 TiB
>>>>>> one-391@24 40 TiB 1.4 TiB
>>>>>> one-391 40 TiB 2.2 TiB
>>>>>> <TOTAL> 40 TiB 27 TiB
>>>>>> NAME PROVISIONED USED
>>>>>> one-394@3 70 TiB 1.4 TiB
>>>>>> one-394@9 70 TiB 2.5 TiB
>>>>>> one-394@15 70 TiB 20 GiB
>>>>>> one-394@16 70 TiB 0 B
>>>>>> one-394@17 70 TiB 0 B
>>>>>> one-394@18 70 TiB 0 B
>>>>>> one-394@19 70 TiB 383 GiB
>>>>>> one-394@20 70 TiB 3.3 TiB
>>>>>> one-394@21 70 TiB 5.0 TiB
>>>>>> one-394@22 70 TiB 5.0 TiB
>>>>>> one-394@23 70 TiB 9.0 TiB
>>>>>> one-394@24 70 TiB 1.6 TiB
>>>>>> one-394 70 TiB 2.5 TiB
>>>>>> <TOTAL> 70 TiB 31 TiB
>>>>>> NAME PROVISIONED USED
>>>>>> one-434 25 TiB 9.1 TiB
>>>>>>
>>>>>> The large 70TiB images one-391 and one-394 are currently copied
to with ca. 5TiB per day.
>>>>>>
>>>>>> Output of "ceph df detail" with some columns removed:
>>>>>>
>>>>>> NAME ID USED %USED MAX AVAIL
OBJECTS RAW USED
>>>>>> sr-rbd-data-one-hdd 11 119 TiB 58.45 84 TiB
31286554 158 TiB
>>>>>>
>>>>>> Pool is EC 6+2.
>>>>>> USED is correct: 31286554*4MiB=119TiB.
>>>>>> RAW USED is correct: 119*8/6=158TiB.
>>>>>> Most of this data is freshly copied onto large RBD images.
>>>>>> Compression is enabled on this pool (aggressive,snappy).
>>>>>>
>>>>>> However, when looking at "deph osd df tree", I get
>>>>>>
>>>>>> The combined raw capacity of OSDs backing this pool is 406.8TiB
(sum over SIZE).
>>>>>> Summing up column USE over all OSDs gives 227.5TiB.
>>>>>>
>>>>>> This gives a difference of 69TiB (=227-158) that is not accounted
for.
>>>>>>
>>>>>> Here the output of "ceph osd df tree limited" to the
drives backing the pool:
>>>>>>
>>>>>> ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP
META AVAIL %USE VAR PGS TYPE NAME
>>>>>> 84 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB
180 MiB 16 GiB 3.9 TiB 56.43 1.72 103 osd.84
>>>>>> 145 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.6 TiB
144 MiB 14 GiB 4.3 TiB 51.37 1.57 87 osd.145
>>>>>> 156 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.1 TiB
173 MiB 16 GiB 3.8 TiB 57.91 1.77 100 osd.156
>>>>>> 168 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB
164 MiB 16 GiB 3.9 TiB 56.31 1.72 98 osd.168
>>>>>> 181 hdd 8.90999 1.00000 8.9 TiB 5.5 TiB 5.4 TiB
121 MiB 17 GiB 3.5 TiB 61.26 1.87 105 osd.181
>>>>>> 74 hdd 8.90999 1.00000 8.9 TiB 4.2 TiB 4.2 TiB
148 MiB 13 GiB 4.7 TiB 46.79 1.43 85 osd.74
>>>>>> 144 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB
106 MiB 15 GiB 4.2 TiB 53.17 1.62 94 osd.144
>>>>>> 157 hdd 8.90999 1.00000 8.9 TiB 5.8 TiB 5.8 TiB
192 MiB 18 GiB 3.1 TiB 65.02 1.99 111 osd.157
>>>>>> 169 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
172 MiB 16 GiB 3.8 TiB 56.99 1.74 102 osd.169
>>>>>> 180 hdd 8.90999 1.00000 8.9 TiB 5.8 TiB 5.8 TiB
131 MiB 18 GiB 3.1 TiB 65.04 1.99 111 osd.180
>>>>>> 60 hdd 8.90999 1.00000 8.9 TiB 4.5 TiB 4.5 TiB
155 MiB 14 GiB 4.4 TiB 50.40 1.54 93 osd.60
>>>>>> 146 hdd 8.90999 1.00000 8.9 TiB 4.8 TiB 4.8 TiB
139 MiB 15 GiB 4.1 TiB 53.70 1.64 92 osd.146
>>>>>> 158 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.5 TiB
183 MiB 17 GiB 3.4 TiB 62.30 1.90 109 osd.158
>>>>>> 170 hdd 8.90999 1.00000 8.9 TiB 5.7 TiB 5.6 TiB
205 MiB 18 GiB 3.2 TiB 63.53 1.94 112 osd.170
>>>>>> 182 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.6 TiB
105 MiB 14 GiB 4.3 TiB 52.27 1.60 92 osd.182
>>>>>> 63 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB
156 MiB 15 GiB 4.2 TiB 52.74 1.61 98 osd.63
>>>>>> 148 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.1 TiB
119 MiB 16 GiB 3.8 TiB 57.82 1.77 100 osd.148
>>>>>> 159 hdd 8.90999 1.00000 8.9 TiB 4.0 TiB 4.0 TiB
89 MiB 12 GiB 4.9 TiB 44.61 1.36 79 osd.159
>>>>>> 172 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
173 MiB 16 GiB 3.8 TiB 57.22 1.75 98 osd.172
>>>>>> 183 hdd 8.90999 1.00000 8.9 TiB 6.0 TiB 6.0 TiB
135 MiB 19 GiB 2.9 TiB 67.35 2.06 118 osd.183
>>>>>> 229 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.6 TiB
127 MiB 15 GiB 4.3 TiB 52.05 1.59 93 osd.229
>>>>>> 232 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.2 TiB
158 MiB 17 GiB 3.7 TiB 58.22 1.78 101 osd.232
>>>>>> 235 hdd 8.90999 1.00000 8.9 TiB 4.1 TiB 4.1 TiB
103 MiB 13 GiB 4.8 TiB 45.96 1.40 79 osd.235
>>>>>> 238 hdd 8.90999 1.00000 8.9 TiB 5.4 TiB 5.4 TiB
120 MiB 17 GiB 3.5 TiB 60.47 1.85 104 osd.238
>>>>>> 259 hdd 10.91399 1.00000 11 TiB 6.2 TiB 6.2 TiB
140 MiB 19 GiB 4.7 TiB 56.54 1.73 120 osd.259
>>>>>> 231 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
114 MiB 16 GiB 3.8 TiB 56.90 1.74 101 osd.231
>>>>>> 233 hdd 8.90999 1.00000 8.9 TiB 5.5 TiB 5.5 TiB
123 MiB 17 GiB 3.4 TiB 61.78 1.89 106 osd.233
>>>>>> 236 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
114 MiB 16 GiB 3.8 TiB 57.53 1.76 101 osd.236
>>>>>> 239 hdd 8.90999 1.00000 8.9 TiB 4.2 TiB 4.2 TiB
95 MiB 13 GiB 4.7 TiB 47.41 1.45 86 osd.239
>>>>>> 263 hdd 10.91399 1.00000 11 TiB 5.3 TiB 5.3 TiB
178 MiB 17 GiB 5.6 TiB 48.73 1.49 102 osd.263
>>>>>> 228 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
113 MiB 16 GiB 3.8 TiB 57.10 1.74 96 osd.228
>>>>>> 230 hdd 8.90999 1.00000 8.9 TiB 4.9 TiB 4.9 TiB
144 MiB 16 GiB 4.0 TiB 55.20 1.69 99 osd.230
>>>>>> 234 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.6 TiB
164 MiB 18 GiB 3.3 TiB 63.29 1.93 109 osd.234
>>>>>> 237 hdd 8.90999 1.00000 8.9 TiB 4.8 TiB 4.8 TiB
110 MiB 15 GiB 4.1 TiB 54.33 1.66 97 osd.237
>>>>>> 260 hdd 10.91399 1.00000 11 TiB 5.4 TiB 5.4 TiB
152 MiB 17 GiB 5.5 TiB 49.35 1.51 104 osd.260
>>>>>> 0 hdd 8.90999 1.00000 8.9 TiB 5.2 TiB 5.2 TiB
157 MiB 16 GiB 3.7 TiB 58.28 1.78 102 osd.0
>>>>>> 2 hdd 8.90999 1.00000 8.9 TiB 5.3 TiB 5.2 TiB
122 MiB 16 GiB 3.6 TiB 59.05 1.80 106 osd.2
>>>>>> 72 hdd 8.90999 1.00000 8.9 TiB 4.4 TiB 4.4 TiB
145 MiB 14 GiB 4.5 TiB 49.89 1.52 89 osd.72
>>>>>> 76 hdd 8.90999 1.00000 8.9 TiB 5.1 TiB 5.1 TiB
178 MiB 16 GiB 3.8 TiB 56.89 1.74 102 osd.76
>>>>>> 86 hdd 8.90999 1.00000 8.9 TiB 4.6 TiB 4.5 TiB
155 MiB 14 GiB 4.3 TiB 51.18 1.56 94 osd.86
>>>>>> 1 hdd 8.90999 1.00000 8.9 TiB 4.9 TiB 4.9 TiB
141 MiB 15 GiB 4.0 TiB 54.73 1.67 95 osd.1
>>>>>> 3 hdd 8.90999 1.00000 8.9 TiB 4.7 TiB 4.7 TiB
156 MiB 15 GiB 4.2 TiB 52.40 1.60 94 osd.3
>>>>>> 73 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 4.9 TiB
146 MiB 16 GiB 3.9 TiB 55.68 1.70 102 osd.73
>>>>>> 85 hdd 8.90999 1.00000 8.9 TiB 5.6 TiB 5.5 TiB
192 MiB 18 GiB 3.3 TiB 62.46 1.91 109 osd.85
>>>>>> 87 hdd 8.90999 1.00000 8.9 TiB 5.0 TiB 5.0 TiB
189 MiB 16 GiB 3.9 TiB 55.91 1.71 102 osd.87
>>>>>>
>>>>>> Best regards,
>>>>>> =================
>>>>>> Frank Schilder
>>>>>> AIT Risø Campus
>>>>>> Bygning 109, rum S14
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io