August 2020 - ceph-users

Many scrub errors after update to 14.2.10

by Andreas Haupt

Hi *, after updating our CEPH cluster from 14.2.9 to 14.2.10 it accumulates scrub errors on multiple osds: [cephmon1] /root # ceph health detail HEALTH_ERR 6 scrub errors; Possible data damage: 6 pgs inconsistent OSD_SCRUB_ERRORS 6 scrub errors PG_DAMAGED Possible data damage: 6 pgs inconsistent pg 3.69 is active+clean+inconsistent, acting [59,65,61] pg 3.73 is active+clean+inconsistent, acting [73,88,25] pg 12.29 is active+clean+inconsistent, acting [55,92,42] pg 12.38 is active+clean+inconsistent, acting [150,42,13] pg 12.46 is active+clean+inconsistent, acting [55,18,84] pg 12.75 is active+clean+inconsistent, acting [55,155,49] They all can easily get repaired (ceph pg repair $pg) - but I wonder what could be the source of the problem. The cluster started with Luminous some years ago, was updated to Mimic, then Nautilus. Never seen this before! OSDs are a mixture of HDD/SSD, both are affected. All on Bluestore. Any idea? Was there maybe a code change between 14.2.9 & 14.2.10 that could explain this? Errors in syslog look like this: Aug 5 19:21:21 krake08 ceph-osd: 2020-08-05 19:21:21.831 7fb6b2b9d700 -1 log_channel(cluster) log [ERR] : 12.38 scrub : stat mismatch, got 74/74 objects, 20/20 clones, 74/74 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 182904850/172877842 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. Aug 5 19:21:21 krake08 ceph-osd: 2020-08-05 19:21:21.831 7fb6b2b9d700 -1 log_channel(cluster) log [ERR] : 12.38 scrub 1 errors Aug 6 08:28:44 krake08 ceph-osd: 2020-08-06 08:28:44.477 7fb6b2b9d700 -1 log_channel(cluster) log [ERR] : 12.38 repair : stat mismatch, got 76/76 objects, 22/22 clones, 76/76 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 183166994/173139986 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. Aug 6 08:28:44 krake08 ceph-osd: 2020-08-06 08:28:44.477 7fb6b2b9d700 -1 log_channel(cluster) log [ERR] : 12.38 repair 1 errors, 1 fixed Thanks in advance, Andreas -- | Andreas Haupt | E-Mail: andreas.haupt(a)desy.de | DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt | Platanenallee 6 | Phone: +49/33762/7-7359 | D-15738 Zeuthen | Fax: +49/33762/7-7216

3 years, 9 months

1
0
0 0

Get Cash App Customer Service to ask to for what reason is my cash not appearing?

by mary smith

In the event that you are not seeing your cash into your Cash app balance, at that point you have to guarantee that, is there any telephone number or email address that is related to you. Now and again, it happens cash goes to another telephone number or email address which is lined up with you. In the event that you get any issue here, at that point go to converse with Cash App Customer Service. https://www.customercare-email.com/customer-service/cash-app.html

3 years, 9 months

1
0
0 0

Does Cash App Customer Service request for SSN?

by mary smith

Indeed, Cash App Customer Service requests your SSN to check your identity. It is fundamentally requested any government hostile to illegal tax avoidance and against tax avoidance. Here is a law, that makes it important to give your SSN. You don't have a need to enter your full SSN, enter just last four digits of it. Having such kind of warning permits them to affirm your identity, and by which they can easily catch dubious action. https://www.cash-app-customer-service.com/

3 years, 9 months

1
0
0 0

Ceph influxDB support versus Telegraf Ceph plugin?

by victorhooi＠yahoo.com

Hi, I've read that Ceph has some InfluxDB reporting capabilities inbuilt (https://docs.ceph.com/docs/master/mgr/influx/). However, Telegraf, which is the system reporting daemon for InfluxDB, also has a Ceph plugin (https://github.com/influxdata/telegraf/tree/master/plugins/inputs/ceph). Just curious what people's thoughts on the two are, or what they are using in production? Which is easier to deploy/maintain, have you found? Or more useful for alerting, or tracking performance gremlins? Thanks, Victor

3 years, 9 months

2
2
0 0

made a huge mistake, seeking recovery advice (osd zapped)

by Peter Sarossy

hey folks, I was deploying a new set of NVMe cards into my cluster, and while getting the new devices ready, it seems the device names got mixed up, and I managed to to run "sgdisk --zap-all" and "dd if=/dev/zero of="/dev/sd" bs=1M count=100" on some of the active devices. I was adding new cards so I could migrate off the 2k+2m erasure coded setup to a more redundant config, but in my mess up I ran the commands above on 3 of the 4 devices before the ceph status changed and I noticed the mistake. I managed to restore the LVM partition table from backup but it seems to not be enough to restart the OSD... I just need to recover one of the 3 drives to save all of my VM+Docker backing filesystem. I'm running on Kubernetes with Rook, after restoring the partition table it seems to be starting up ok, but then I get a stack trace and the container goes into Error state: https://pastebin.com/5wk1bKy9 Any ideas how to fix this? Or somehow extract the data and put it back together? -- Cheers, Peter Sarossy Technical Program Manager Data Center Data Security - Google LLC.

3 years, 9 months

1
0
0 0

Authorized UV Printer Supplier in India - PH UV Printer

by ela1789brown＠gmail.com

If you are really searching for the best and most reputed UV printer supplier in India then there is only one name i.e. PH UV Printer https://phuvprinter.com/. it is the only UV printer supplier having the wide range of best quality UV printers allows you to print custom design over your products. With the help of this best quality UV printing you can easily print custom design over your products to make them more unique. If you want to know more then visit the official website of PH UV Printer. For more entertaining content visit: https://shayarikapitara.com/sorry-shayari/

3 years, 9 months

1
0
0 0

librbd Image Watcher Errors

by Ml Ml

Hello, with cron i run backups with backurne (https://github.com/JackSlateur/backurne) which is rbd based. Sometimes i get those messages: 2020-08-05T18:42:18.915+0200 7fcdbd7fa700 -1 librbd::ImageWatcher: 0x7fcda400a6a0 image watch failed: 140521330717776, (107) Transport endpoint is not connected 2020-08-05T18:42:18.915+0200 7fcdbd7fa700 -1 librbd::Watcher: 0x7fcda400a6a0 handle_error: handle=140521330717776: (107) Transport endpoint is not connected 2020-08-05T18:45:13.496+0200 7fcdbd7fa700 -1 librbd::ImageWatcher: 0x7fcda400a6a0 image watch failed: 140521330717776, (110) Connection timed out 2020-08-05T18:45:13.496+0200 7fcdbd7fa700 -1 librbd::Watcher: 0x7fcda400a6a0 handle_error: handle=140521330717776: (110) Connection timed out 2020-08-05T18:58:01.408+0200 7fc66ffff700 -1 librbd::ImageWatcher: 0x7fc65c00a770 image watch failed: 140490057987152, (107) Transport endpoint is not connected 2020-08-05T18:58:01.408+0200 7fc66ffff700 -1 librbd::Watcher: 0x7fc65c00a770 handle_error: handle=140490057987152: (107) Transport endpoint is not connected Any idea what this indicates and how it could be solved? Thanks, Michael

3 years, 9 months

1
0
0 0

RGW versioned objects lost after Octopus 15.2.3 -> 15.2.4 upgrade

by Chris Palmer

I have an RGW bucket (backups) that is versioned. A nightly job creates a new version of a few objects. There is a lifecycle policy (see below) that keeps 18 days of versions. This has been working perfectly and has not been changed. Until I upgraded Octopus... The nightly job creates separate log files, including a listing of the object versions. From these I can see that: 13/7 02:14 versions from 13/7 01:13 back to 24/6 01:17 (correct) 14/7 02:14 versions from 14/7 01:13 back to 25/6 01:14 (correct) 14/7 10:00 upgrade Octopus 15.2.3 -> 15.2.4 15/7 02:14 versions from 15/7 01:13 back to 25/6 01:14 (would have expected 25/6 to have expired) 16/7 02:14 versions from 16/7 01:13 back to 15/7 01:13 (now all pre-upgrade versions have wrongly disappeared) It's not a big deal for me as they are only backups, providing it continues to work correctly from now on. However it may affect some other people much more. Any ideas on the root cause? And if it is likely to be stable again now? Thanks, Chris { "Rules": [ { "Expiration": { "ExpiredObjectDeleteMarker": true }, "ID": "Expiration & incomplete uploads", "Prefix": "", "Status": "Enabled", "NoncurrentVersionExpiration": { "NoncurrentDays": 18 }, "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 1 } } ] }

3 years, 9 months

4
9
0 0

module cephadm has failed

by Daniel Davidson

I am having trouble getting rid of an error after creating a new ceph cluster. The error is: Module 'cephadm' has failed: auth get failed: failed to find client.crash.ceph-0 in keyring retval: -2 Checking the keyrings and disguising the keys: # ceph auth ls ... client.crash.ceph-0.data.igb.illinois.edu key: keyA caps: [mgr] profile crash caps: [mon] profile crash ... mgr.ceph-0.uymvya key: keyB caps: [mds] allow * caps: [mon] profile mgr caps: [osd] allow * ... Then on the node: # cat mgr.ceph-0.uymvya/keyring [mgr.ceph-0.uymvya] key = keyB caps mds = "allow *" caps mon = "profile mgr" caps osd = "allow *" So the keys match where they should, but I am thinking that the domain missing off the mgr might be the issue. This is keeping me from getting rados working since the cluster is showing a health issue. It looks like this is an issue in my version of ceph (15.2.4)? Can someone point me to the fix or at minimum tell me how to remove the mgr on ceph-0 from the config so that I can get the system healthy. thanks, Dan

3 years, 9 months

1
0
0 0

Abysmal performance in Ceph cluster

by Loschwitz,Martin Gerhard

Folks, we’re building a Ceph cluster based on HDDs with SSDs for WAL/DB files. We have four nodes with 8TB disks and two SSDs and four nodes with many small HDDs (1.4-2.7TB) and four SSDs for the journals. HDDs are configured as RAID 0 on the controllers with writethrough enabled. I am writing this e-mail as we see absolutely catastrophic performance on the cluster (I am talking about anything between literally no throughput for seconds to 200mb/s, wildly varying). We’ve checked very single layer: network is far from being saturated (we have 25gbit/s uplinks and we can confirm that they deliver 25gbit/s). Using iostat, we can prove that during a „rados bench“ call neither the SSDs nor the actual hard disks are anywhere near 100% disk utilization. Usually usage does not exceed 55%. Servers are Dell RX740d with PERC onboard controllers. We also have four SSD-only nodes. When benchmarking against a pool on these, I reliably get 400-500 mb/s when doing the same 64k size test that I ran against the HDD pool. We’ve tried a number of things such as enabling the BBWC on the RAID controllers to no major success. „ceph tell osd.X bench“ will show 55-100 iops for HDDs even though their journals are on SSDs. We have also tried disabling the SSD’s write cache (the ones with the journals on them) to no success. Any pointer to what we may haver overseen would be greatly appreciated. Best regards Martin

3 years, 9 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users August 2020