Its a test cluster each node with a single OSD and 4GB RAM.

On Tue, Sep 10, 2019 at 3:42 PM Ashley Merrick <singapore@amerrick.co.uk> wrote:
What's specs ate the machines?

Recovery work will use more memory the general clean operation and looks like your maxing out the available memory on the machines during CEPH trying to recover.



---- On Tue, 10 Sep 2019 18:10:50 +0800 amudhan83@gmail.com wrote ----

I have also found below error in dmesg.

[332884.028810] systemd-journald[6240]: Failed to parse kernel command line, ignoring: Cannot allocate memory
[332885.054147] systemd-journald[6240]: Out of memory.
[332894.844765] systemd[1]: systemd-journald.service: Main process exited, code=exited, status=1/FAILURE
[332897.199736] systemd[1]: systemd-journald.service: Failed with result 'exit-code'.
[332906.503076] systemd[1]: Failed to start Journal Service.
[332937.909198] systemd[1]: ceph-crash.service: Main process exited, code=exited, status=1/FAILURE
[332939.308341] systemd[1]: ceph-crash.service: Failed with result 'exit-code'.
[332949.545907] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[332949.546631] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 7.
[332949.546781] systemd[1]: Stopped Journal Service.
[332949.566402] systemd[1]: Starting Journal Service...
[332950.190332] systemd[1]: ceph-osd@1.service: Main process exited, code=killed, status=6/ABRT
[332950.190477] systemd[1]: ceph-osd@1.service: Failed with result 'signal'.
[332950.842297] systemd-journald[6249]: File /var/log/journal/8f2559099bf54865adc95e5340d04447/system.journal corrupted or uncleanly shut down, renaming and replacing.
[332951.019531] systemd[1]: Started Journal Service.

On Tue, Sep 10, 2019 at 3:04 PM Amudhan P <amudhan83@gmail.com> wrote:
Hi,

I am using ceph version 13.2.6 (mimic) on test setup trying with cephfs.

My current setup:
3 nodes, 1 node contain two bricks and other 2 nodes contain single brick each.

Volume is a 3 replica, I am trying to simulate node failure.

I powered down one host and started getting msg in other systems when running any command
"-bash: fork: Cannot allocate memory" and system not responding to commands.

what could be the reason for this?
at this stage, I could able to read some of the data stored in the volume and some just waiting for IO.

output from "sudo ceph -s"
  cluster:
    id:     7c138e13-7b98-4309-b591-d4091a1742b4
    health: HEALTH_WARN
            1 osds down
            2 hosts (3 osds) down
            Degraded data redundancy: 5313488/7970232 objects degraded (66.667%), 64 pgs degraded

  services:
    mon: 1 daemons, quorum mon01
    mgr: mon01(active)
    mds: cephfs-tst-1/1/1 up  {0=mon01=up:active}
    osd: 4 osds: 1 up, 2 in

  data:
    pools:   2 pools, 64 pgs
    objects: 2.66 M objects, 206 GiB
    usage:   421 GiB used, 3.2 TiB / 3.6 TiB avail
    pgs:     5313488/7970232 objects degraded (66.667%)
             64 active+undersized+degraded

  io:
    client:   79 MiB/s rd, 24 op/s rd, 0 op/s wr

output from : sudo ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS
 0   hdd 1.81940        0     0 B     0 B     0 B     0    0   0
 3   hdd 1.81940        0     0 B     0 B     0 B     0    0   0
 1   hdd 1.81940  1.00000 1.8 TiB 211 GiB 1.6 TiB 11.34 1.00   0
 2   hdd 1.81940  1.00000 1.8 TiB 210 GiB 1.6 TiB 11.28 1.00  64
                    TOTAL 3.6 TiB 421 GiB 3.2 TiB 11.31
MIN/MAX VAR: 1.00/1.00  STDDEV: 0.03

regards
Amudhan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io