Ceph

Where to find docs

The official Ceph documentation is located on https://docs.ceph.com/en/latest/rados/troubleshooting/

It is strongly advised to use the documentation for the version being used.

Pacific - https://docs.ceph.com/en/pacific/rados/troubleshooting/
Quincy - https://docs.ceph.com/en/quincy/rados/troubleshooting/
Reef - https://docs.ceph.com/en/reef/rados/troubleshooting/

Critical medium error

The block device sdf has errors. You can see this in the kernel ring buffer, for example.

$ sudo dmesg
[...]
[14062414.575715] sd 14:0:5:0: [sdf] tag#2120 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[14062414.575722] sd 14:0:5:0: [sdf] tag#2120 Sense Key : Medium Error [current] [descriptor]
[14062414.575725] sd 14:0:5:0: [sdf] tag#2120 Add. Sense: Unrecovered read error
[14062414.575728] sd 14:0:5:0: [sdf] tag#2120 CDB: Read(16) 88 00 00 00 00 01 09 7c d9 50 00 00 00 80 00 00
[14062414.575730] critical medium error, dev sdf, sector 4454144360 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 2

It may also be displayed in the health details of Ceph.

$ ceph -s
[...]
    health: HEALTH_WARN
            Too many repaired reads on 1 OSDs
[...]

$ ceph health detail
HEALTH_WARN Too many repaired reads on 1 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
    osd.17 had 13 reads repaire

In this case the block device sdf is in the storage node sto1001. The OSD assigned to this block device can be determined.

$ ceph device ls | grep 'sto1001:sdf'
SEAGATE_ST16000NM004J_ZR604ZDZ0000C210PWE9  sto1001:sdf      osd.17

If you only know the OSD ID, you can also determine the associated block device and the storage node.

$ ceph device ls | grep osd.17
[...]
SEAGATE_ST16000NM004J_ZR604ZDZ0000C210PWE9  sto1001:sdf      osd.17

The broken OSD can be removed from the Ceph cluster. The Ceph cluster is then rebalanced. This can take some time and cause a high level of activity on the Ceph cluster.

$ ceph osd out osd.17
marked out osd.17.

On the storage node disable the OSD service for the OSD.

$ sudo systemctl stop ceph-osd@17.service

Where to find docs​

Critical medium error​

Where to find docs

Critical medium error