Skip to main content

Migrate from Ceph-Ansible to Cephadm

warning

The migration from ceph-ansible to cephadm is performed in place by adopting existing Ceph daemons one at a time. While this process is designed to be non-disruptive, it is strongly advised to test the migration in a controlled environment first, such as the OSISM testbed. Ensure that precautionary backups are made and all other necessary safety measures are in place before migrating a production cluster.

Known limitations

This guide is a work in progress. The following areas are not yet covered or tested:

  • Multi-site RGW: Only single-site RGW deployments have been tested. Multi-site migration instructions will be added in a future update.
  • Backup and safety measures: Specific guidance on recommended backup strategies and concrete pre-migration safety measures is still being prepared.
  • Automated readiness checks: A planned osism apply ready-for-cephadm task to automate prerequisite verification is not yet available.

This note will be updated as additional sections are completed.

Background

The deployment tool ceph-ansible is deprecated as of OSISM 10 and will not be supported in upcoming OSISM releases. The official recommendation from the Ceph project is to migrate to cephadm.

After migration, Ceph daemons run as containers managed by cephadm instead of ceph-ansible. All lifecycle operations (upgrades, expansions, configuration changes) are then performed through cephadm and the Ceph orchestrator.

For the full upstream documentation, refer to Switching from ceph-ansible to cephadm.

Prerequisites

  • A running Ceph cluster deployed with ceph-ansible via OSISM.
  • All Ceph daemons are healthy (ceph -s reports HEALTH_OK or only expected warnings).
  • SSH access from the OSISM manager node to all Ceph nodes (required by cephadm for orchestration).
  • The Ceph cluster must be running at least Ceph Octopus (15.2.x) or later. Clusters on OSISM 7 or later already meet this requirement.
  • Python 3 and lvm2 must be installed on all Ceph nodes (these are typically already present).

TODO: Consider replacing the manual prerequisite checks with an osism apply ready-for-cephadm task that automatically verifies all conditions (cluster health, SSH access, Ceph version, required packages).

Step 1: Verify cluster health

Before starting the migration, ensure the cluster is in a healthy state. Run the following commands on the OSISM manager node.

ceph -s
ceph osd tree
ceph df

All PGs should be active+clean. Resolve any degraded or misplaced PGs before proceeding. In ceph osd tree, verify that all OSDs show up with a non-zero REWEIGHT. In ceph df, check that %RAW USED is well below 85% (Ceph's default nearfull threshold).

Step 2: Install cephadm

Install cephadm on all Ceph nodes. The version of cephadm must match the running Ceph release.

First, determine the running Ceph release on the OSISM manager node :

ceph version

This returns output like ceph version 18.2.7 (...) reef (stable). The full version number (e.g. 18.2.7) is needed for the next step.

Control nodes (mon, mgr)

note

OSD nodes require a different installation method due to a known bug in the UCA cephadm package for Reef. See the OSD nodes section below before installing on OSD nodes.

Install the cephadm package from the Ubuntu Cloud Archive on the control nodes:

Ceph Releasecephadm Package
Quincy (17)cephadm_17.2.9-0ubuntu0.22.04.1~cloud0_amd64.deb
Reef (18)cephadm_18.2.4-0ubuntu1~cloud1_amd64.deb
Squid (19)cephadm_19.2.3-0ubuntu0.24.04.2~cloud0_amd64.deb

These are the only versions available in the UCA. The exact point release does not need to match your running Ceph version -- only the release series (Quincy, Reef, Squid) matters.

The package is installed directly via dpkg rather than apt install because the Ubuntu Cloud Archive is typically not configured as an apt source in OSISM environments. Adding the full UCA repository just for cephadm is not recommended, as it could introduce unintended package upgrades from the UCA that conflict with the versions managed by OSISM.

CEPHADM_PKG=cephadm_18.2.4-0ubuntu1~cloud1_amd64.deb
curl --silent --remote-name --location https://ubuntu-cloud.archive.canonical.com/ubuntu/pool/main/c/ceph/${CEPHADM_PKG}
sudo dpkg -i ${CEPHADM_PKG}

OSD nodes

If you are running Quincy or Squid, install cephadm on OSD nodes using the same UCA package as described for control nodes above.

If you are running Reef (18.x), the UCA cephadm package (18.2.4) contains a known bug that causes a crash when parsing AppArmor profiles during OSD adoption. This was fixed upstream and is included in Reef v18.2.5+, but the UCA package has not been updated beyond 18.2.4.

On Reef OSD nodes, install cephadm as a standalone Python script from the Ceph Git repository instead to get a version that includes the fix:

CEPH_RELEASE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}' | cut -d: -f2)
curl --silent --remote-name --location https://raw.githubusercontent.com/ceph/ceph/${CEPH_RELEASE}/src/cephadm/cephadm.py
chmod +x cephadm.py
sudo mv cephadm.py /usr/sbin/cephadm

Step 3: Prepare the cephadm configuration

On each Ceph node, prepare the host for cephadm. The cephadm prepare-host command performs a series of checks to ensure the host meets the requirements for managing Ceph daemons with cephadm. Specifically, it verifies:

  • A container runtime (Podman or Docker) is installed and functional
  • LVM2 (lvm2 package) is available — required by Ceph OSDs for managing logical volumes
  • Time synchronization (e.g. chrony or NTP) is enabled and running — clock skew between Ceph nodes can cause monitors to lose quorum
  • General system prerequisites such as the availability of systemctl

If any of these checks fail, prepare-host will report the issue so it can be resolved before proceeding with the migration.

sudo cephadm prepare-host

The output should look similar to this:

Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chrony.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Host looks OK

If the output indicates errors or missing dependencies, resolve them before proceeding.

Determine the currently used container image and set it in the Ceph configuration. Run the following commands on the OSISM manager node:

MON_NODE=$(osism get hosts -l ceph-mon | awk 'NR>3 && /\|/ {print $2}' | head
-1)
CEPH_IMAGE=$(ssh ${MON_NODE} "docker inspect \$(docker ps --filter 'name=ceph'
--format '{{.Names}}' | head -1) --format '{{.Config.Image}}'")
echo "Container image: ${CEPH_IMAGE}"
ceph config set global container_image ${CEPH_IMAGE}
warning

Do not run the docker inspect command on the manager node — it runs the cephclient container there and would return a wrong image (cephclient:reef instead of ceph-daemon:reef).

Import the existing ceph.conf into the central monitor config store. This ensures that custom tuning parameters are preserved after migration, as cephadm uses the centralized config store instead of per-node ceph.conf files. Since the ceph CLI is not installed on the host, execute the command inside the crash container on one of the monitor nodes:

docker exec ceph-crash-$(hostname) ceph config assimilate-conf -i /etc/ceph/ceph.conf

In a typical OSISM deployment, the ceph.conf is identical across all nodes, so running this once is sufficient. If nodes have individual tuning parameters in their ceph.conf, run the command on each affected node.

Step 4: Configure cephadm

Enable the cephadm orchestrator module on the OSISM manager node:

ceph mgr module enable cephadm
ceph orch set backend cephadm

Configure cephadm to use the dragon user (which has passwordless sudo) instead of root. In a standard OSISM deployment, the operator SSH key already exists at /opt/ansible/secrets/id_rsa.operator. Import it so that cephadm can connect to all Ceph nodes. On the OSISM manager node:

ceph cephadm set-user dragon
cp /opt/ansible/secrets/id_rsa.operator* /opt/cephclient/data/
ceph cephadm set-priv-key -i /data/id_rsa.operator
ceph cephadm set-pub-key -i /data/id_rsa.operator.pub
rm /opt/cephclient/data/id_rsa.operator*
info

If no existing SSH key is available, generate a new one and distribute it to all Ceph nodes instead:

ceph cephadm generate-key
ceph cephadm get-pub-key > /tmp/cephadm-pub-key.pub

Copy the public key to all Ceph nodes:

ssh-copy-id -f -i /tmp/cephadm-pub-key.pub dragon@<node>

Or use a loop to distribute the key to all Ceph nodes at once:

for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
ssh-copy-id -f -i /tmp/cephadm-pub-key.pub dragon@${node}
done

Register all Ceph nodes with the orchestrator on the OSISM manager node:

ceph orch host add <node> <node-ip>

Or use a loop to register all Ceph nodes at once:

for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
ceph orch host add ${node} $(getent hosts ${node} | awk '{print $1}')
done

The output should look similar to this:

Added host 'testbed-node-0' with addr '192.168.16.10'
Added host 'testbed-node-1' with addr '192.168.16.11'
Added host 'testbed-node-2' with addr '192.168.16.12'
Added host 'testbed-node-3' with addr '192.168.16.13'
Added host 'testbed-node-4' with addr '192.168.16.14'
Added host 'testbed-node-5' with addr '192.168.16.15'

Verify that all hosts have been registered:

ceph orch host ls

The output should look similar to this:

HOST ADDR LABELS STATUS
testbed-node-0 192.168.16.10
testbed-node-1 192.168.16.11
testbed-node-2 192.168.16.12
testbed-node-3 192.168.16.13
testbed-node-4 192.168.16.14
testbed-node-5 192.168.16.15
6 hosts in cluster

Step 5: Adopt daemons

Adopt all Ceph daemons across the cluster. The recommended order is:

  1. Monitors (mon)
  2. Managers (mgr)
  3. OSDs (osd)

For each daemon, run the adopt command on the respective node.

warning

Verify cluster health with ceph -s on the OSISM manager node after each step. Do not proceed if ceph -s reports HEALTH_ERR or degraded/unavailable PGs.

During migration, HEALTH_WARN with messages like "stray daemon(s) not managed by cephadm" or "failed to probe daemons or devices" is expected and safe to continue.

Adopting monitor daemons

The monitor daemons (MON) maintain the cluster map and are responsible for consensus among the Ceph nodes. The adopt command converts each monitor from its legacy systemd/Docker-based deployment (as set up by ceph-ansible) to a cephadm-managed container. During this process cephadm will:

  1. Pull the container image (if not already present)
  2. Stop and disable the old systemd unit (e.g. ceph-mon@<hostname>)
  3. Move the monitor's data directory and logs into the cephadm-managed directory layout under /var/lib/ceph/<fsid>/
  4. Create new systemd units managed by cephadm

The monitor remains available throughout — the other monitors maintain quorum while one is being adopted.

On each monitor node, set the required variables and adopt the daemon:

CEPH_HOSTNAME=$(hostname)
CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name mon.${CEPH_HOSTNAME}
sudo systemctl reset-failed ceph-mon@${CEPH_HOSTNAME}.service 2>/dev/null || true

The output should look similar to this:

Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Stopping old systemd unit ceph-mon@testbed-node-0...
Disabling old systemd unit ceph-mon@testbed-node-0...
Moving data...
Chowning content...
Moving logs...
Creating new units...

Adopting manager daemons

The manager daemons (MGR) provide additional monitoring and management interfaces for the cluster (e.g. the dashboard, Prometheus metrics, and the orchestrator module). The adopt process is identical to that of the monitors — cephadm stops the legacy unit, migrates data and logs, and creates new cephadm-managed systemd units. Since multiple managers run in active/standby mode, adopting one at a time ensures the cluster always has an active manager available.

On each manager node, set the required variables and adopt the daemon:

CEPH_HOSTNAME=$(hostname)
CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name mgr.${CEPH_HOSTNAME}
sudo systemctl reset-failed ceph-mgr@${CEPH_HOSTNAME}.service 2>/dev/null || true

The output should look similar to this:

Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Stopping old systemd unit ceph-mgr@testbed-node-0...
Disabling old systemd unit ceph-mgr@testbed-node-0...
Moving data...
Chowning content...
Moving logs...
Creating new units...

Verify monitors and managers

After all monitors and managers have been adopted, verify on the OSISM manager node that the orchestrator recognises them:

List all services known to the orchestrator:

ceph orch ls --refresh

The output should look similar to this. Both services show as <unmanaged> at this point, which is expected:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 3/0 43s ago - <unmanaged>
mon 3/0 43s ago - <unmanaged>

List all monitor daemon instances and their status:

ceph orch ps --daemon-type mon

The output should look similar to this. All monitors should show as running:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mon.testbed-node-0 testbed-node-0 running (22m) 2m ago 13m 73.6M 2048M 18.2.8 01985efead8e e9f0ac0ce245
mon.testbed-node-1 testbed-node-1 running (19m) 2m ago 13m 69.5M 2048M 18.2.8 01985efead8e aa12850676f7
mon.testbed-node-2 testbed-node-2 running (19m) 2m ago 13m 64.5M 2048M 18.2.8 01985efead8e 43a13bac74fb

List all manager daemon instances and their status:

ceph orch ps --daemon-type mgr

The output should look similar to this. All managers should show as running:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mgr.testbed-node-0 testbed-node-0 running (3m) 2m ago 2m 459M - 18.2.8 01985efead8e b6c5b884b38c
mgr.testbed-node-1 testbed-node-1 running (2m) 2m ago - 458M - 18.2.8 01985efead8e 8adf9e898e82
mgr.testbed-node-2 testbed-node-2 running (2m) 2m ago - 504M - 18.2.8 01985efead8e 54e779780c5a

Adopting OSD daemons

The OSD daemons (Object Storage Daemon) are responsible for storing the actual data on disk. Adopting OSDs is the most sensitive part of the migration because each OSD manages real data volumes (BlueStore). The adopt process migrates each OSD's data directory, block device symlinks, and logs into the cephadm layout — but the underlying data on disk is not moved or modified.

Because an OSD restart temporarily reduces the number of available replicas, safety flags (noout, nodeep-scrub, balancer off) are set beforehand to prevent Ceph from initiating unnecessary data rebalancing or deep scrubs while OSDs are being restarted during adoption. The PG autoscaler is also disabled to avoid placement group changes during the process.

Before adopting OSDs, set safety flags on the OSISM manager node to prevent unnecessary data movement and PG changes during the adoption process:

ceph osd set noout
ceph osd set nodeep-scrub
ceph balancer off

Disable the PG autoscaler on all pools that have it enabled. Record which pools had it enabled so it can be re-enabled after adoption:

for pool in $(ceph osd pool ls); do
mode=$(ceph osd pool get ${pool} pg_autoscale_mode -f json | python3 -c "import json,sys; print(json.load(sys.stdin)['pg_autoscale_mode'])")
if [ "${mode}" = "on" ]; then
echo "${pool}" >> /home/dragon/autoscale_pools.txt
ceph osd pool set ${pool} pg_autoscale_mode off
fi
done

On each OSD node, set the required variable and identify the OSDs running on the node:

CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
docker ps --filter "name=ceph-osd"

The output should look similar to this:

CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES
05eb59c1ef36 registry.osism.tech/osism/ceph-daemon:reef "/usr/bin/ceph-osd …" 5 days ago Up 5 days ceph-osd-3
8646edf83163 registry.osism.tech/osism/ceph-daemon:reef "/usr/bin/ceph-osd …" 5 days ago Up 5 days ceph-osd-1

The OSD ID is the number after ceph-osd- in the container name. For example, ceph-osd-1 has OSD ID 1 and ceph-osd-3 has OSD ID 3.

warning

Adopt OSDs one node at a time. After completing all OSDs on a node, wait until all PGs return to active+clean (ceph -s shows all PGs active+clean under data:) before proceeding to the next node. The overall health may still show HEALTH_WARN for stray daemons during migration — this is expected and not a stop condition.

Within a single node, it is safe to use the loop below without waiting between individual OSDs, provided your cluster meets both of the following conditions:

  • At least 3 OSD nodes.
  • Host-level CRUSH failure domain — the OSISM default. Verify with:
    ceph osd crush rule dump
    All rules should show "type": "host" in the chooseleaf step.

If your cluster has only 2 OSD nodes, or any rule shows a different failure domain type, adopt each OSD individually and wait for all PGs to return to active+clean after each one before continuing.

Then adopt each OSD on the node:

OSD_ID=<osd_id>
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name osd.${OSD_ID}
sudo systemctl reset-failed ceph-osd@${OSD_ID}.service 2>/dev/null || true

The output should look similar to this:

Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Found online OSD at //var/lib/ceph/osd/ceph-1/fsid
objectstore_type is bluestore
Stopping old systemd unit ceph-osd@1...
Disabling old systemd unit ceph-osd@1...
Moving data...
Chowning content...
Chowning /var/lib/ceph/11111111-1111-1111-1111-111111111111/osd.1/block...
Disabling host unit ceph-volume@ lvm unit...
Non-zero exit code 1 from systemctl disable ceph-volume@lvm-1-f3cfe0e4-70f3-4078-9aba-2d45170e9df9.service
systemctl: stderr Failed to disable unit: Unit file ceph-volume@lvm-1-f3cfe0e4-70f3-4078-9aba-2d45170e9df9.service does not exist.
Moving logs...
Creating new units...

Or use a loop to adopt all OSDs on the current node at once:

for osd_id in $(docker ps --filter "name=ceph-osd" --format "{{.Names}}" | sed 's/ceph-osd-//'); do
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name osd.${osd_id}
sudo systemctl reset-failed ceph-osd@${osd_id}.service 2>/dev/null || true
done
warning

Adopt OSDs one node at a time. After completing all OSDs on a node, wait until all PGs return to active+clean (ceph -s shows all PGs active+clean under data:) before proceeding to the next node. The overall health may still show HEALTH_WARN for stray daemons during migration — this is expected and not a stop condition.

Within a single node, it is safe to use the loop above without waiting between individual OSDs, provided your cluster meets both of the following conditions:

  • At least 3 OSD nodes.
  • Host-level CRUSH failure domain — the OSISM default. Verify with:
    ceph osd crush rule dump
    All rules should show "type": "host" in the chooseleaf step.

If your cluster has only 2 OSD nodes, or any rule shows a different failure domain type, adopt each OSD individually and wait for all PGs to return to active+clean after each one before continuing.

info

During OSD adoption you may see an error like:

Non-zero exit code 1 from systemctl disable ceph-volume@lvm-<OSD_ID>-<UUID>.service
Failed to disable unit: Unit file ceph-volume@lvm-<OSD_ID>-<UUID>.service does not exist.

This error is harmless. Cephadm attempts to disable the legacy ceph-volume systemd unit as part of the adoption process. When Ceph was deployed with ceph-ansible using containers, this unit does not exist, so the disable command fails. The OSD is still adopted correctly.

After all OSDs have been adopted, verify that the orchestrator recognises them:

ceph orch ls --refresh

The output should now also include the OSD service:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 3/0 5m ago - <unmanaged>
mon 3/0 5m ago - <unmanaged>
osd 6 5m ago - <unmanaged>

Verify that all OSD daemon instances are running:

ceph orch ps --daemon-type osd --refresh

The output should look similar to this. All OSDs should show as running:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
osd.0 testbed-node-3 running (46s) 0s ago - 181M 4096M 18.2.8 01985efead8e f83bb9204db5
osd.1 testbed-node-4 running (5m) 0s ago 2m 167M 4096M 18.2.8 01985efead8e 11cd8d77a78b
osd.2 testbed-node-5 running (118s) 0s ago - 176M 4096M 18.2.8 01985efead8e f608633171a8
osd.3 testbed-node-4 running (3m) 0s ago 2m 196M 4096M 18.2.8 01985efead8e 5032744c6063
osd.4 testbed-node-5 running (2m) 0s ago - 179M 4096M 18.2.8 01985efead8e a9c9f18801d9
osd.5 testbed-node-3 running (38s) 0s ago - 153M 4096M 18.2.8 01985efead8e bfd02ac996db

Once all PGs are active+clean, remove the safety flags and re-enable the PG autoscaler on the OSISM manager node:

ceph osd unset noout
ceph osd unset nodeep-scrub
ceph balancer on
if [ -f /home/dragon/autoscale_pools.txt ]; then
while read pool; do
if ceph osd pool set ${pool} pg_autoscale_mode on > /dev/null 2>&1; then
echo "OK: ${pool}"
else
echo "FAILED: ${pool}"
fi
done < /home/dragon/autoscale_pools.txt
fi
# Once all pools show OK:, run: rm /home/dragon/autoscale_pools.txt
fi

Migrating crash daemons

The crash daemons cannot be adopted from ceph-ansible and must be redeployed. Stop and remove the legacy crash containers on each Ceph node:

CEPH_HOSTNAME=$(hostname)
sudo systemctl stop ceph-crash@${CEPH_HOSTNAME}.service
sudo systemctl disable ceph-crash@${CEPH_HOSTNAME}.service

Then deploy new crash daemons via cephadm on the OSISM manager node:

ceph orch apply crash

Verify that the crash daemons are running:

ceph orch ls

The output should now include the crash service:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
crash 6/6 3s ago 23s *
mgr 3/0 3s ago - <unmanaged>
mon 3/0 3s ago - <unmanaged>
osd 6 3s ago - <unmanaged>

Migrating RGW daemons

note

The migration of RGW has currently only been tested for single-site deployments. Instructions for multi-site RGW setups will be added in a future update of this guide.

RGW daemons cannot be adopted in-place with cephadm adopt. Instead, new RGW daemons are deployed via the orchestrator and the legacy daemons are stopped afterwards.

Determine the RGW nodes on the OSISM manager node:

osism get hosts -l ceph-rgw

Determine the RGW realm, zone group, and zone name from the running cluster:

radosgw-admin realm list
radosgw-admin zonegroup list
radosgw-admin zone list

In a typical single-site OSISM deployment, the zone and zonegroup are default. The realm may be empty — the script below falls back to default in that case, giving a service ID of default.default. The service ID for the ceph orch apply rgw command is composed as <realm_name>.<zone_name> (e.g. default.default).

Determine the RGW frontend port from the ceph-ansible configuration (environments/ceph/configuration.yml). The variable radosgw_frontend_port contains the port (default: 8081).

If the RGW service was configured with SSL (i.e. radosgw_frontend_ssl_certificate is set in the ceph-ansible configuration), the SSL certificate must be imported into the Ceph config-key store before deploying. Run this on each RGW node:

sudo ceph config-key set rgw/cert/rgw.$(hostname) -i <path_to_ssl_certificate>
warning

Unlike MDS daemons, RGW daemons bind to a specific port. The legacy daemon on a node must be stopped before the new one can start, otherwise it will fail due to a port conflict. The migration is therefore performed one node at a time to minimize S3/Swift API downtime — the remaining RGW nodes continue serving requests while one node is being migrated.

Prepare the variables and deploy the RGW service on the OSISM manager node. The orchestrator will attempt to start daemons on each node, but they will only come up once the legacy daemon on that node has been stopped. If SSL is used, add --ssl to the command:

RGW_REALM=$(radosgw-admin realm list --format json | python3 -c "import json,sys; print(json.load(sys.stdin)['realms'][0])" 2>/dev/null || echo "default")
RGW_ZONE=$(radosgw-admin zone list --format json | python3 -c "import json,sys; print(json.load(sys.stdin)['zones'][0])" 2>/dev/null || echo "default")
RGW_SERVICE_ID="${RGW_REALM}.${RGW_ZONE}"
RGW_PLACEMENT=$(osism get hosts -l ceph-rgw | awk 'NR>3 && /\|/ {print $2}' | paste -sd,)
RGW_PORT=$(python3 -c "import yaml; print(yaml.safe_load(open('/opt/configuration/environments/ceph/configuration.yml')).get('radosgw_frontend_port', 8081))")

ceph orch apply rgw ${RGW_SERVICE_ID} --placement="${RGW_PLACEMENT}" --port=${RGW_PORT}
# If SSL is enabled, add --ssl:
# ceph orch apply rgw ${RGW_SERVICE_ID} --placement="${RGW_PLACEMENT}" --port=${RGW_PORT} --ssl

Then migrate each RGW node sequentially. On the OSISM manager node:

for node in $(osism get hosts -l ceph-rgw | awk 'NR>3 && /\|/ {print $2}'); do
echo "Stopping legacy RGW daemon on ${node}..."
ssh ${node} "sudo systemctl stop ceph-radosgw.target; sudo systemctl disable ceph-radosgw.target"

echo "Waiting for new RGW daemon on ${node}..."
until ceph orch ps --daemon-type rgw --format json | python3 -c "
import json, sys
daemons = json.load(sys.stdin)
sys.exit(0 if any(d['hostname'] == '${node}' and d.get('status_desc') == 'running' for d in daemons) else 1)
" 2>/dev/null; do
sleep 5
done
echo "New RGW daemon on ${node} is running."
done

The output should look similar to this:

Stopping legacy RGW daemon on testbed-node-3...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-3...
New RGW daemon on testbed-node-3 is running.
Stopping legacy RGW daemon on testbed-node-4...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-4...
New RGW daemon on testbed-node-4 is running.
Stopping legacy RGW daemon on testbed-node-5...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-5...
New RGW daemon on testbed-node-5 is running.

Verify that all new RGW daemons are running:

ceph orch ps --daemon-type rgw

The output should look similar to this:

NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
rgw.default.default.testbed-node-3.dhmops testbed-node-3 *:8081 running (39s) 32s ago 39s 90.7M - 18.2.8 fb822dbe2ee6 d4b9a7c26ecd
rgw.default.default.testbed-node-4.ibjvjy testbed-node-4 *:8081 running (41s) 32s ago 41s 90.5M - 18.2.8 fb822dbe2ee6 fcf4d6c837f4
rgw.default.default.testbed-node-5.xcegdm testbed-node-5 *:8081 running (43s) 32s ago 43s 91.5M - 18.2.8 fb822dbe2ee6 569fd2afad81

Migrating MDS daemons

MDS daemons cannot be adopted in-place with cephadm adopt. Instead, new MDS daemons are deployed via the orchestrator and the legacy daemons are stopped afterwards.

Determine the CephFS filesystem name on the OSISM manager node:

ceph fs ls

This returns output like name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]. The name (e.g. cephfs) is needed for the next step.

The CephFS name also corresponds to the cephfs variable in the ceph-ansible configuration (environments/ceph/configuration.yml).

Determine the MDS nodes on the OSISM manager node:

osism get hosts -l ceph-mds

Deploy the new MDS daemons and stop the legacy ones by running the following script on the OSISM manager node:

info

The new MDS daemons deployed by the orchestrator will come up as standby instances. CephFS supports multiple MDS daemons simultaneously (controlled by max_mds), and new daemons automatically join as standby. The legacy active daemons continue serving requests until they are explicitly stopped, at which point CephFS promotes the standby daemons to active. This brief coexistence of old and new daemons is expected and safe.

CEPHFS_NAME=$(ceph fs ls --format json | python3 -c "import json,sys; print(json.load(sys.stdin)[0]['name'])")
MDS_PLACEMENT=$(osism get hosts -l ceph-mds | awk 'NR>3 && /\|/ {print $2}' | paste -sd,)

# Deploy new MDS daemons via the orchestrator
ceph orch apply mds ${CEPHFS_NAME} --placement="${MDS_PLACEMENT}"

# Wait until the new MDS daemons are running
echo "Waiting for new MDS daemons..."
until ceph orch ps --daemon-type mds --format json | python3 -c "
import json, sys
daemons = json.load(sys.stdin)
running = [d for d in daemons if d.get('status_desc') == 'running']
sys.exit(0 if len(running) >= len('${MDS_PLACEMENT}'.split(',')) else 1)
" 2>/dev/null; do
sleep 5
done
echo "New MDS daemons are running."

Verify that the new MDS daemons are running:

ceph orch ls --service-type mds

The output should look similar to this:

NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mds.cephfs 3/3 0s ago 13s testbed-node-3;testbed-node-4;testbed-node-5

Then stop and disable the legacy MDS daemons. Run the following on the OSISM manager node:

for node in $(osism get hosts -l ceph-mds | awk 'NR>3 && /\|/ {print $2}'); do
echo "Stopping legacy MDS daemon on ${node}..."
ssh ${node} "sudo systemctl stop ceph-mds@${node}.service; sudo systemctl disable ceph-mds@${node}.service"
done

Step 6: Verify the migration

Verify the full cluster state on the OSISM manager node:

ceph -s
ceph orch ps

All daemons should appear in the ceph orch ps output with status running. Monitor and manager services should show with a placement (managed). OSD daemons will still show as <unmanaged> — this is expected (see the note in Step 6 on how to optionally transition them to a managed state). Cluster health should be HEALTH_OK.

Verify that all daemons are running the same Ceph version (on the OSISM manager node):

ceph versions

The output should show a single version across all daemon types. If multiple versions appear for any daemon type, some daemons were not adopted with the correct container image.

{
"mon": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"mgr": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"osd": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 6
},
"mds": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"rgw": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"overall": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 18
}
}

Step 7: Clean up ceph-ansible artifacts

Only proceed with this step after completing the verification in Step 7: ceph -s shows HEALTH_OK, all daemons appear as running in ceph orch ps, and ceph versions reports a single consistent version across all daemon types.

  1. Remove old systemd unit files that are no longer used. Run on each Ceph node. Do not use a wildcard like ceph-*.service as this would also remove the cephadm-managed units (e.g. ceph-<FSID>@.service). Remove only the legacy units:

    sudo rm -f /etc/systemd/system/ceph-mon@.service \
    /etc/systemd/system/ceph-mgr@.service \
    /etc/systemd/system/ceph-osd@.service \
    /etc/systemd/system/ceph-mds@.service \
    /etc/systemd/system/ceph-radosgw@.service \
    /etc/systemd/system/ceph-crash@.service \
    /etc/systemd/system/ceph-mon.target \
    /etc/systemd/system/ceph-mgr.target \
    /etc/systemd/system/ceph-osd.target \
    /etc/systemd/system/ceph-mds.target \
    /etc/systemd/system/ceph-radosgw.target \
    /etc/systemd/system/ceph.target
    sudo rm -rf /etc/systemd/system/ceph.target.wants
    sudo systemctl daemon-reload
  2. The ceph-ansible configuration in the OSISM configuration repository (environments/ceph/) can be kept as a reference but is no longer active.

note

Do not remove /etc/ceph/ceph.conf or any keyring files. These are still used by the cluster and are now managed by cephadm.

Post-migration operations

After migration, Ceph lifecycle operations are performed through cephadm and the Ceph orchestrator instead of osism apply ceph-* commands.

ceph-ansible (before)cephadm (after)
osism apply ceph-monsceph orch apply mon
osism apply ceph-mgrsceph orch apply mgr
osism apply ceph-osdsceph orch apply osd
osism apply ceph-rgwsceph orch apply rgw
osism apply ceph-mdssceph orch apply mds
Editing configuration.ymlceph config set <section> <key> <val>
osism apply ceph-rolling_updateceph orch upgrade start --image <img>

For the full cephadm operations reference, see the cephadm documentation.

Troubleshooting

Adoption fails with "daemon not found"

Ensure that the daemon is still running under the legacy systemd unit before attempting adoption. Check with:

sudo systemctl status ceph-<type>@<id>

OSD adoption fails with missing fsid or type file

In some containerized deployments, the files /var/lib/ceph/osd/ceph-<id>/fsid and /var/lib/ceph/osd/ceph-<id>/type may not exist on the host. Cephadm requires these files during adoption. Create them manually before retrying:

OSD_ID=<osd_id>
OSD_FSID=$(sudo ceph-volume lvm list ${OSD_ID} --format json | python3 -c "import json,sys; d=json.load(sys.stdin); print([e['tags']['ceph.osd_fsid'] for e in d['${OSD_ID}'] if e['type']=='block'][0])")
sudo mkdir -p /var/lib/ceph/osd/ceph-${OSD_ID}
echo "${OSD_FSID}" | sudo tee /var/lib/ceph/osd/ceph-${OSD_ID}/fsid
echo "bluestore" | sudo tee /var/lib/ceph/osd/ceph-${OSD_ID}/type

Container image pull fails

Verify that the node can reach the container registry. For OSISM environments:

sudo cephadm pull registry.osism.tech/osism/ceph-daemon:${CEPH_RELEASE}

If the pull fails, you may see an error like:

Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
ERROR: failed to pull image: unable to pull image: Error initializing source docker://registry.osism.tech/osism/ceph-daemon:reef: error pinging docker registry registry.osism.tech: Get "https://registry.osism.tech/v2/": dial tcp 185.56.112.10:443: i/o timeout

This typically indicates a network connectivity issue. If the node is behind a proxy, ensure the container runtime is configured to use it (e.g. via HTTP_PROXY/HTTPS_PROXY in /etc/systemd/system/docker.service.d/proxy.conf or the equivalent for Podman).

Cluster health degrades during migration

Stop the migration and investigate. Common causes:

  • A daemon was adopted on the wrong node.
  • The container image version does not match the running Ceph version.
  • Network connectivity issues between nodes.

Resolve the issue before continuing. An adopted daemon can be reverted by stopping the cephadm-managed unit and restarting the legacy systemd unit if it was preserved:

# Example: reverting an adopted OSD with ID 3
# Stop the cephadm-managed unit
sudo systemctl stop ceph-<FSID>@osd.3.service
sudo systemctl disable ceph-<FSID>@osd.3.service

# Restart the legacy systemd unit
sudo systemctl enable ceph-osd@3.service
sudo systemctl start ceph-osd@3.service

Replace <FSID> with the cluster's FSID, which can be found with ceph fsid. The same pattern applies to other daemon types (replace osd.3 with e.g. mon.<hostname> or mgr.<hostname>).

SSH connection issues

Cephadm requires SSH access to all nodes. Verify on the OSISM manager node:

ceph cephadm check-host <node>

Or use a loop to check all Ceph nodes at once:

for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
echo "=== ${node} ==="
ceph cephadm check-host ${node}
done

Ensure the cephadm SSH key has been distributed to all nodes (see Step 4).