Migrate from Ceph-Ansible to Cephadm
The migration from ceph-ansible to cephadm is performed in place by adopting existing Ceph daemons one at a time. While this process is designed to be non-disruptive, it is strongly advised to test the migration in a controlled environment first, such as the OSISM testbed. Ensure that precautionary backups are made and all other necessary safety measures are in place before migrating a production cluster.
This guide is a work in progress. The following areas are not yet covered or tested:
- Multi-site RGW: Only single-site RGW deployments have been tested. Multi-site migration instructions will be added in a future update.
- Backup and safety measures: Specific guidance on recommended backup strategies and concrete pre-migration safety measures is still being prepared.
- Automated readiness checks: A planned
osism apply ready-for-cephadmtask to automate prerequisite verification is not yet available.
This note will be updated as additional sections are completed.
Background
The deployment tool ceph-ansible is deprecated as of OSISM 10 and will not be supported in upcoming OSISM releases. The official recommendation from the Ceph project is to migrate to cephadm.
After migration, Ceph daemons run as containers managed by cephadm instead of ceph-ansible. All lifecycle operations (upgrades, expansions, configuration changes) are then performed through cephadm and the Ceph orchestrator.
For the full upstream documentation, refer to Switching from ceph-ansible to cephadm.
Prerequisites
- A running Ceph cluster deployed with ceph-ansible via OSISM.
- All Ceph daemons are healthy (
ceph -sreportsHEALTH_OKor only expected warnings). - SSH access from the OSISM manager node to all Ceph nodes (required by cephadm for orchestration).
- The Ceph cluster must be running at least Ceph Octopus (15.2.x) or later. Clusters on OSISM 7 or later already meet this requirement.
- Python 3 and
lvm2must be installed on all Ceph nodes (these are typically already present).
TODO: Consider replacing the manual prerequisite checks with an osism apply ready-for-cephadm
task that automatically verifies all conditions (cluster health, SSH access, Ceph version,
required packages).
Step 1: Verify cluster health
Before starting the migration, ensure the cluster is in a healthy state. Run the following commands on the OSISM manager node.
ceph -s
ceph osd tree
ceph df
All PGs should be active+clean. Resolve any degraded or misplaced PGs before proceeding.
In ceph osd tree, verify that all OSDs show up with a non-zero REWEIGHT. In ceph df,
check that %RAW USED is well below 85% (Ceph's default nearfull threshold).
Step 2: Install cephadm
Install cephadm on all Ceph nodes. The version of cephadm must match the running Ceph release.
First, determine the running Ceph release on the OSISM manager node :
ceph version
This returns output like ceph version 18.2.7 (...) reef (stable). The full version
number (e.g. 18.2.7) is needed for the next step.
Control nodes (mon, mgr)
OSD nodes require a different installation method due to a known bug in the UCA cephadm package for Reef. See the OSD nodes section below before installing on OSD nodes.
Install the cephadm package from the Ubuntu Cloud Archive on the control nodes:
| Ceph Release | cephadm Package |
|---|---|
| Quincy (17) | cephadm_17.2.9-0ubuntu0.22.04.1~cloud0_amd64.deb |
| Reef (18) | cephadm_18.2.4-0ubuntu1~cloud1_amd64.deb |
| Squid (19) | cephadm_19.2.3-0ubuntu0.24.04.2~cloud0_amd64.deb |
These are the only versions available in the UCA. The exact point release does not need to match your running Ceph version -- only the release series (Quincy, Reef, Squid) matters.
The package is installed directly via dpkg rather than apt install because the
Ubuntu Cloud Archive is typically not configured as an apt source in OSISM environments.
Adding the full UCA repository just for cephadm is not recommended, as it could
introduce unintended package upgrades from the UCA that conflict with the versions
managed by OSISM.
CEPHADM_PKG=cephadm_18.2.4-0ubuntu1~cloud1_amd64.deb
curl --silent --remote-name --location https://ubuntu-cloud.archive.canonical.com/ubuntu/pool/main/c/ceph/${CEPHADM_PKG}
sudo dpkg -i ${CEPHADM_PKG}
OSD nodes
If you are running Quincy or Squid, install cephadm on OSD nodes using the same UCA package as described for control nodes above.
If you are running Reef (18.x), the UCA cephadm package (18.2.4) contains a known bug that causes a crash when parsing AppArmor profiles during OSD adoption. This was fixed upstream and is included in Reef v18.2.5+, but the UCA package has not been updated beyond 18.2.4.
On Reef OSD nodes, install cephadm as a standalone Python script from the Ceph Git repository instead to get a version that includes the fix:
CEPH_RELEASE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}' | cut -d: -f2)
curl --silent --remote-name --location https://raw.githubusercontent.com/ceph/ceph/${CEPH_RELEASE}/src/cephadm/cephadm.py
chmod +x cephadm.py
sudo mv cephadm.py /usr/sbin/cephadm
Step 3: Prepare the cephadm configuration
On each Ceph node, prepare the host for cephadm. The cephadm prepare-host command
performs a series of checks to ensure the host meets the requirements for managing
Ceph daemons with cephadm. Specifically, it verifies:
- A container runtime (Podman or Docker) is installed and functional
- LVM2 (
lvm2package) is available — required by Ceph OSDs for managing logical volumes - Time synchronization (e.g. chrony or NTP) is enabled and running — clock skew between Ceph nodes can cause monitors to lose quorum
- General system prerequisites such as the availability of
systemctl
If any of these checks fail, prepare-host will report the issue so it can be resolved
before proceeding with the migration.
sudo cephadm prepare-host
The output should look similar to this:
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chrony.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit chrony.service is enabled and running
Host looks OK
If the output indicates errors or missing dependencies, resolve them before proceeding.
Determine the currently used container image and set it in the Ceph configuration. Run the following commands on the OSISM manager node:
MON_NODE=$(osism get hosts -l ceph-mon | awk 'NR>3 && /\|/ {print $2}' | head
-1)
CEPH_IMAGE=$(ssh ${MON_NODE} "docker inspect \$(docker ps --filter 'name=ceph'
--format '{{.Names}}' | head -1) --format '{{.Config.Image}}'")
echo "Container image: ${CEPH_IMAGE}"
ceph config set global container_image ${CEPH_IMAGE}
Do not run the docker inspect command on the manager node — it runs the
cephclient
container there and would return a wrong image (cephclient:reef instead of
ceph-daemon:reef).
Import the existing ceph.conf into the central monitor config store. This ensures
that custom tuning parameters are preserved after migration, as cephadm uses the
centralized config store instead of per-node ceph.conf files. Since the ceph CLI
is not installed on the host, execute the command inside the crash container on
one of the monitor nodes:
docker exec ceph-crash-$(hostname) ceph config assimilate-conf -i /etc/ceph/ceph.conf
In a typical OSISM deployment, the ceph.conf is identical across all nodes, so running
this once is sufficient. If nodes have individual tuning parameters in their ceph.conf,
run the command on each affected node.
Step 4: Configure cephadm
Enable the cephadm orchestrator module on the OSISM manager node:
ceph mgr module enable cephadm
ceph orch set backend cephadm
Configure cephadm to use the dragon user (which has passwordless sudo) instead of
root. In a standard OSISM deployment, the operator SSH key already exists at
/opt/ansible/secrets/id_rsa.operator. Import it so that cephadm can connect to all
Ceph nodes. On the OSISM manager node:
ceph cephadm set-user dragon
cp /opt/ansible/secrets/id_rsa.operator* /opt/cephclient/data/
ceph cephadm set-priv-key -i /data/id_rsa.operator
ceph cephadm set-pub-key -i /data/id_rsa.operator.pub
rm /opt/cephclient/data/id_rsa.operator*
If no existing SSH key is available, generate a new one and distribute it to all Ceph nodes instead:
ceph cephadm generate-key
ceph cephadm get-pub-key > /tmp/cephadm-pub-key.pub
Copy the public key to all Ceph nodes:
ssh-copy-id -f -i /tmp/cephadm-pub-key.pub dragon@<node>
Or use a loop to distribute the key to all Ceph nodes at once:
for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
ssh-copy-id -f -i /tmp/cephadm-pub-key.pub dragon@${node}
done
Register all Ceph nodes with the orchestrator on the OSISM manager node:
ceph orch host add <node> <node-ip>
Or use a loop to register all Ceph nodes at once:
for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
ceph orch host add ${node} $(getent hosts ${node} | awk '{print $1}')
done
The output should look similar to this:
Added host 'testbed-node-0' with addr '192.168.16.10'
Added host 'testbed-node-1' with addr '192.168.16.11'
Added host 'testbed-node-2' with addr '192.168.16.12'
Added host 'testbed-node-3' with addr '192.168.16.13'
Added host 'testbed-node-4' with addr '192.168.16.14'
Added host 'testbed-node-5' with addr '192.168.16.15'
Verify that all hosts have been registered:
ceph orch host ls
The output should look similar to this:
HOST ADDR LABELS STATUS
testbed-node-0 192.168.16.10
testbed-node-1 192.168.16.11
testbed-node-2 192.168.16.12
testbed-node-3 192.168.16.13
testbed-node-4 192.168.16.14
testbed-node-5 192.168.16.15
6 hosts in cluster
Step 5: Adopt daemons
Adopt all Ceph daemons across the cluster. The recommended order is:
- Monitors (mon)
- Managers (mgr)
- OSDs (osd)
For each daemon, run the adopt command on the respective node.
Verify cluster health with ceph -s on the OSISM manager node after each step.
Do not proceed if ceph -s reports HEALTH_ERR or degraded/unavailable PGs.
During migration, HEALTH_WARN with messages like "stray daemon(s) not managed by
cephadm" or "failed to probe daemons or devices" is expected and safe to continue.
Adopting monitor daemons
The monitor daemons (MON) maintain the cluster map and are responsible for consensus among the Ceph nodes. The adopt command converts each monitor from its legacy systemd/Docker-based deployment (as set up by ceph-ansible) to a cephadm-managed container. During this process cephadm will:
- Pull the container image (if not already present)
- Stop and disable the old systemd unit (e.g.
ceph-mon@<hostname>) - Move the monitor's data directory and logs into the cephadm-managed directory layout
under
/var/lib/ceph/<fsid>/ - Create new systemd units managed by cephadm
The monitor remains available throughout — the other monitors maintain quorum while one is being adopted.
On each monitor node, set the required variables and adopt the daemon:
CEPH_HOSTNAME=$(hostname)
CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name mon.${CEPH_HOSTNAME}
sudo systemctl reset-failed ceph-mon@${CEPH_HOSTNAME}.service 2>/dev/null || true
The output should look similar to this:
Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Stopping old systemd unit ceph-mon@testbed-node-0...
Disabling old systemd unit ceph-mon@testbed-node-0...
Moving data...
Chowning content...
Moving logs...
Creating new units...
Adopting manager daemons
The manager daemons (MGR) provide additional monitoring and management interfaces for the cluster (e.g. the dashboard, Prometheus metrics, and the orchestrator module). The adopt process is identical to that of the monitors — cephadm stops the legacy unit, migrates data and logs, and creates new cephadm-managed systemd units. Since multiple managers run in active/standby mode, adopting one at a time ensures the cluster always has an active manager available.
On each manager node, set the required variables and adopt the daemon:
CEPH_HOSTNAME=$(hostname)
CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name mgr.${CEPH_HOSTNAME}
sudo systemctl reset-failed ceph-mgr@${CEPH_HOSTNAME}.service 2>/dev/null || true
The output should look similar to this:
Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Stopping old systemd unit ceph-mgr@testbed-node-0...
Disabling old systemd unit ceph-mgr@testbed-node-0...
Moving data...
Chowning content...
Moving logs...
Creating new units...
Verify monitors and managers
After all monitors and managers have been adopted, verify on the OSISM manager node that the orchestrator recognises them:
List all services known to the orchestrator:
ceph orch ls --refresh
The output should look similar to this. Both services show as <unmanaged> at this point,
which is expected:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 3/0 43s ago - <unmanaged>
mon 3/0 43s ago - <unmanaged>
List all monitor daemon instances and their status:
ceph orch ps --daemon-type mon
The output should look similar to this. All monitors should show as running:
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mon.testbed-node-0 testbed-node-0 running (22m) 2m ago 13m 73.6M 2048M 18.2.8 01985efead8e e9f0ac0ce245
mon.testbed-node-1 testbed-node-1 running (19m) 2m ago 13m 69.5M 2048M 18.2.8 01985efead8e aa12850676f7
mon.testbed-node-2 testbed-node-2 running (19m) 2m ago 13m 64.5M 2048M 18.2.8 01985efead8e 43a13bac74fb
List all manager daemon instances and their status:
ceph orch ps --daemon-type mgr
The output should look similar to this. All managers should show as running:
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
mgr.testbed-node-0 testbed-node-0 running (3m) 2m ago 2m 459M - 18.2.8 01985efead8e b6c5b884b38c
mgr.testbed-node-1 testbed-node-1 running (2m) 2m ago - 458M - 18.2.8 01985efead8e 8adf9e898e82
mgr.testbed-node-2 testbed-node-2 running (2m) 2m ago - 504M - 18.2.8 01985efead8e 54e779780c5a
Adopting OSD daemons
The OSD daemons (Object Storage Daemon) are responsible for storing the actual data on disk. Adopting OSDs is the most sensitive part of the migration because each OSD manages real data volumes (BlueStore). The adopt process migrates each OSD's data directory, block device symlinks, and logs into the cephadm layout — but the underlying data on disk is not moved or modified.
Because an OSD restart temporarily reduces the number of available replicas, safety flags
(noout, nodeep-scrub, balancer off) are set beforehand to prevent Ceph from
initiating unnecessary data rebalancing or deep scrubs while OSDs are being restarted
during adoption. The PG autoscaler is also disabled to avoid placement group changes
during the process.
Before adopting OSDs, set safety flags on the OSISM manager node to prevent unnecessary data movement and PG changes during the adoption process:
ceph osd set noout
ceph osd set nodeep-scrub
ceph balancer off
Disable the PG autoscaler on all pools that have it enabled. Record which pools had it enabled so it can be re-enabled after adoption:
for pool in $(ceph osd pool ls); do
mode=$(ceph osd pool get ${pool} pg_autoscale_mode -f json | python3 -c "import json,sys; print(json.load(sys.stdin)['pg_autoscale_mode'])")
if [ "${mode}" = "on" ]; then
echo "${pool}" >> /home/dragon/autoscale_pools.txt
ceph osd pool set ${pool} pg_autoscale_mode off
fi
done
On each OSD node, set the required variable and identify the OSDs running on the node:
CEPH_IMAGE=$(docker inspect $(docker ps --filter "name=ceph" --format "{{.Names}}" | head -1) --format '{{.Config.Image}}')
docker ps --filter "name=ceph-osd"
The output should look similar to this:
CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES
05eb59c1ef36 registry.osism.tech/osism/ceph-daemon:reef "/usr/bin/ceph-osd …" 5 days ago Up 5 days ceph-osd-3
8646edf83163 registry.osism.tech/osism/ceph-daemon:reef "/usr/bin/ceph-osd …" 5 days ago Up 5 days ceph-osd-1
The OSD ID is the number after ceph-osd- in the container name. For example,
ceph-osd-1 has OSD ID 1 and ceph-osd-3 has OSD ID 3.
Adopt OSDs one node at a time. After completing all OSDs on a node, wait until all
PGs return to active+clean (ceph -s shows all PGs active+clean under data:)
before proceeding to the next node. The overall health may still show HEALTH_WARN for
stray daemons during migration — this is expected and not a stop condition.
Within a single node, it is safe to use the loop below without waiting between individual OSDs, provided your cluster meets both of the following conditions:
- At least 3 OSD nodes.
- Host-level CRUSH failure domain — the OSISM default. Verify with:
All rules should showceph osd crush rule dump
"type": "host"in thechooseleafstep.
If your cluster has only 2 OSD nodes, or any rule shows a different failure domain type,
adopt each OSD individually and wait for all PGs to return to active+clean after each
one before continuing.
Then adopt each OSD on the node:
OSD_ID=<osd_id>
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name osd.${OSD_ID}
sudo systemctl reset-failed ceph-osd@${OSD_ID}.service 2>/dev/null || true
The output should look similar to this:
Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
Found online OSD at //var/lib/ceph/osd/ceph-1/fsid
objectstore_type is bluestore
Stopping old systemd unit ceph-osd@1...
Disabling old systemd unit ceph-osd@1...
Moving data...
Chowning content...
Chowning /var/lib/ceph/11111111-1111-1111-1111-111111111111/osd.1/block...
Disabling host unit ceph-volume@ lvm unit...
Non-zero exit code 1 from systemctl disable ceph-volume@lvm-1-f3cfe0e4-70f3-4078-9aba-2d45170e9df9.service
systemctl: stderr Failed to disable unit: Unit file ceph-volume@lvm-1-f3cfe0e4-70f3-4078-9aba-2d45170e9df9.service does not exist.
Moving logs...
Creating new units...
Or use a loop to adopt all OSDs on the current node at once:
for osd_id in $(docker ps --filter "name=ceph-osd" --format "{{.Names}}" | sed 's/ceph-osd-//'); do
sudo cephadm --image ${CEPH_IMAGE} adopt --style legacy --skip-firewalld --name osd.${osd_id}
sudo systemctl reset-failed ceph-osd@${osd_id}.service 2>/dev/null || true
done
Adopt OSDs one node at a time. After completing all OSDs on a node, wait until all
PGs return to active+clean (ceph -s shows all PGs active+clean under data:)
before proceeding to the next node. The overall health may still show HEALTH_WARN for
stray daemons during migration — this is expected and not a stop condition.
Within a single node, it is safe to use the loop above without waiting between individual OSDs, provided your cluster meets both of the following conditions:
- At least 3 OSD nodes.
- Host-level CRUSH failure domain — the OSISM default. Verify with:
All rules should showceph osd crush rule dump
"type": "host"in thechooseleafstep.
If your cluster has only 2 OSD nodes, or any rule shows a different failure domain type,
adopt each OSD individually and wait for all PGs to return to active+clean after each
one before continuing.
During OSD adoption you may see an error like:
Non-zero exit code 1 from systemctl disable ceph-volume@lvm-<OSD_ID>-<UUID>.service
Failed to disable unit: Unit file ceph-volume@lvm-<OSD_ID>-<UUID>.service does not exist.
This error is harmless. Cephadm attempts to disable the legacy ceph-volume systemd unit
as part of the adoption process. When Ceph was deployed with ceph-ansible using containers,
this unit does not exist, so the disable command fails. The OSD is still adopted correctly.
After all OSDs have been adopted, verify that the orchestrator recognises them:
ceph orch ls --refresh
The output should now also include the OSD service:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 3/0 5m ago - <unmanaged>
mon 3/0 5m ago - <unmanaged>
osd 6 5m ago - <unmanaged>
Verify that all OSD daemon instances are running:
ceph orch ps --daemon-type osd --refresh
The output should look similar to this. All OSDs should show as running:
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
osd.0 testbed-node-3 running (46s) 0s ago - 181M 4096M 18.2.8 01985efead8e f83bb9204db5
osd.1 testbed-node-4 running (5m) 0s ago 2m 167M 4096M 18.2.8 01985efead8e 11cd8d77a78b
osd.2 testbed-node-5 running (118s) 0s ago - 176M 4096M 18.2.8 01985efead8e f608633171a8
osd.3 testbed-node-4 running (3m) 0s ago 2m 196M 4096M 18.2.8 01985efead8e 5032744c6063
osd.4 testbed-node-5 running (2m) 0s ago - 179M 4096M 18.2.8 01985efead8e a9c9f18801d9
osd.5 testbed-node-3 running (38s) 0s ago - 153M 4096M 18.2.8 01985efead8e bfd02ac996db
Once all PGs are active+clean, remove the safety flags and re-enable the PG
autoscaler on the OSISM manager node:
ceph osd unset noout
ceph osd unset nodeep-scrub
ceph balancer on
if [ -f /home/dragon/autoscale_pools.txt ]; then
while read pool; do
if ceph osd pool set ${pool} pg_autoscale_mode on > /dev/null 2>&1; then
echo "OK: ${pool}"
else
echo "FAILED: ${pool}"
fi
done < /home/dragon/autoscale_pools.txt
fi
# Once all pools show OK:, run: rm /home/dragon/autoscale_pools.txt
fi
Migrating crash daemons
The crash daemons cannot be adopted from ceph-ansible and must be redeployed. Stop and remove the legacy crash containers on each Ceph node:
CEPH_HOSTNAME=$(hostname)
sudo systemctl stop ceph-crash@${CEPH_HOSTNAME}.service
sudo systemctl disable ceph-crash@${CEPH_HOSTNAME}.service
Then deploy new crash daemons via cephadm on the OSISM manager node:
ceph orch apply crash
Verify that the crash daemons are running:
ceph orch ls
The output should now include the crash service:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
crash 6/6 3s ago 23s *
mgr 3/0 3s ago - <unmanaged>
mon 3/0 3s ago - <unmanaged>
osd 6 3s ago - <unmanaged>
Migrating RGW daemons
The migration of RGW has currently only been tested for single-site deployments. Instructions for multi-site RGW setups will be added in a future update of this guide.
RGW daemons cannot be adopted in-place with cephadm adopt. Instead, new RGW daemons
are deployed via the orchestrator and the legacy daemons are stopped afterwards.
Determine the RGW nodes on the OSISM manager node:
osism get hosts -l ceph-rgw
Determine the RGW realm, zone group, and zone name from the running cluster:
radosgw-admin realm list
radosgw-admin zonegroup list
radosgw-admin zone list
In a typical single-site OSISM deployment, the zone and zonegroup are default. The realm
may be empty — the script below falls back to default in that case, giving a service ID
of default.default.
The service ID for the ceph orch apply rgw command is composed as
<realm_name>.<zone_name> (e.g. default.default).
Determine the RGW frontend port from the ceph-ansible configuration
(environments/ceph/configuration.yml). The variable radosgw_frontend_port contains
the port (default: 8081).
If the RGW service was configured with SSL (i.e. radosgw_frontend_ssl_certificate is
set in the ceph-ansible configuration), the SSL certificate must be imported into the
Ceph config-key store before deploying. Run this on each RGW node:
sudo ceph config-key set rgw/cert/rgw.$(hostname) -i <path_to_ssl_certificate>
Unlike MDS daemons, RGW daemons bind to a specific port. The legacy daemon on a node must be stopped before the new one can start, otherwise it will fail due to a port conflict. The migration is therefore performed one node at a time to minimize S3/Swift API downtime — the remaining RGW nodes continue serving requests while one node is being migrated.
Prepare the variables and deploy the RGW service on the OSISM manager node. The
orchestrator will attempt to start daemons on each node, but they will only come up
once the legacy daemon on that node has been stopped. If SSL is used, add --ssl to
the command:
RGW_REALM=$(radosgw-admin realm list --format json | python3 -c "import json,sys; print(json.load(sys.stdin)['realms'][0])" 2>/dev/null || echo "default")
RGW_ZONE=$(radosgw-admin zone list --format json | python3 -c "import json,sys; print(json.load(sys.stdin)['zones'][0])" 2>/dev/null || echo "default")
RGW_SERVICE_ID="${RGW_REALM}.${RGW_ZONE}"
RGW_PLACEMENT=$(osism get hosts -l ceph-rgw | awk 'NR>3 && /\|/ {print $2}' | paste -sd,)
RGW_PORT=$(python3 -c "import yaml; print(yaml.safe_load(open('/opt/configuration/environments/ceph/configuration.yml')).get('radosgw_frontend_port', 8081))")
ceph orch apply rgw ${RGW_SERVICE_ID} --placement="${RGW_PLACEMENT}" --port=${RGW_PORT}
# If SSL is enabled, add --ssl:
# ceph orch apply rgw ${RGW_SERVICE_ID} --placement="${RGW_PLACEMENT}" --port=${RGW_PORT} --ssl
Then migrate each RGW node sequentially. On the OSISM manager node:
for node in $(osism get hosts -l ceph-rgw | awk 'NR>3 && /\|/ {print $2}'); do
echo "Stopping legacy RGW daemon on ${node}..."
ssh ${node} "sudo systemctl stop ceph-radosgw.target; sudo systemctl disable ceph-radosgw.target"
echo "Waiting for new RGW daemon on ${node}..."
until ceph orch ps --daemon-type rgw --format json | python3 -c "
import json, sys
daemons = json.load(sys.stdin)
sys.exit(0 if any(d['hostname'] == '${node}' and d.get('status_desc') == 'running' for d in daemons) else 1)
" 2>/dev/null; do
sleep 5
done
echo "New RGW daemon on ${node} is running."
done
The output should look similar to this:
Stopping legacy RGW daemon on testbed-node-3...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-3...
New RGW daemon on testbed-node-3 is running.
Stopping legacy RGW daemon on testbed-node-4...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-4...
New RGW daemon on testbed-node-4 is running.
Stopping legacy RGW daemon on testbed-node-5...
Removed "/etc/systemd/system/multi-user.target.wants/ceph-radosgw.target".
Removed "/etc/systemd/system/ceph.target.wants/ceph-radosgw.target".
Waiting for new RGW daemon on testbed-node-5...
New RGW daemon on testbed-node-5 is running.
Verify that all new RGW daemons are running:
ceph orch ps --daemon-type rgw
The output should look similar to this:
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
rgw.default.default.testbed-node-3.dhmops testbed-node-3 *:8081 running (39s) 32s ago 39s 90.7M - 18.2.8 fb822dbe2ee6 d4b9a7c26ecd
rgw.default.default.testbed-node-4.ibjvjy testbed-node-4 *:8081 running (41s) 32s ago 41s 90.5M - 18.2.8 fb822dbe2ee6 fcf4d6c837f4
rgw.default.default.testbed-node-5.xcegdm testbed-node-5 *:8081 running (43s) 32s ago 43s 91.5M - 18.2.8 fb822dbe2ee6 569fd2afad81
Migrating MDS daemons
MDS daemons cannot be adopted in-place with cephadm adopt. Instead, new MDS daemons
are deployed via the orchestrator and the legacy daemons are stopped afterwards.
Determine the CephFS filesystem name on the OSISM manager node:
ceph fs ls
This returns output like name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ].
The name (e.g. cephfs) is needed for the next step.
The CephFS name also corresponds to the cephfs variable in the ceph-ansible
configuration (environments/ceph/configuration.yml).
Determine the MDS nodes on the OSISM manager node:
osism get hosts -l ceph-mds
Deploy the new MDS daemons and stop the legacy ones by running the following script on the OSISM manager node:
The new MDS daemons deployed by the orchestrator will come up as standby instances.
CephFS supports multiple MDS daemons simultaneously (controlled by max_mds), and new
daemons automatically join as standby. The legacy active daemons continue serving requests
until they are explicitly stopped, at which point CephFS promotes the standby daemons to
active. This brief coexistence of old and new daemons is expected and safe.
CEPHFS_NAME=$(ceph fs ls --format json | python3 -c "import json,sys; print(json.load(sys.stdin)[0]['name'])")
MDS_PLACEMENT=$(osism get hosts -l ceph-mds | awk 'NR>3 && /\|/ {print $2}' | paste -sd,)
# Deploy new MDS daemons via the orchestrator
ceph orch apply mds ${CEPHFS_NAME} --placement="${MDS_PLACEMENT}"
# Wait until the new MDS daemons are running
echo "Waiting for new MDS daemons..."
until ceph orch ps --daemon-type mds --format json | python3 -c "
import json, sys
daemons = json.load(sys.stdin)
running = [d for d in daemons if d.get('status_desc') == 'running']
sys.exit(0 if len(running) >= len('${MDS_PLACEMENT}'.split(',')) else 1)
" 2>/dev/null; do
sleep 5
done
echo "New MDS daemons are running."
Verify that the new MDS daemons are running:
ceph orch ls --service-type mds
The output should look similar to this:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mds.cephfs 3/3 0s ago 13s testbed-node-3;testbed-node-4;testbed-node-5
Then stop and disable the legacy MDS daemons. Run the following on the OSISM manager node:
for node in $(osism get hosts -l ceph-mds | awk 'NR>3 && /\|/ {print $2}'); do
echo "Stopping legacy MDS daemon on ${node}..."
ssh ${node} "sudo systemctl stop ceph-mds@${node}.service; sudo systemctl disable ceph-mds@${node}.service"
done
Step 6: Verify the migration
Verify the full cluster state on the OSISM manager node:
ceph -s
ceph orch ps
All daemons should appear in the ceph orch ps output with status running. Monitor
and manager services should show with a placement (managed). OSD daemons will still
show as <unmanaged> — this is expected (see the note in Step 6 on how to optionally
transition them to a managed state). Cluster health should be HEALTH_OK.
Verify that all daemons are running the same Ceph version (on the OSISM manager node):
ceph versions
The output should show a single version across all daemon types. If multiple versions appear for any daemon type, some daemons were not adopted with the correct container image.
{
"mon": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"mgr": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"osd": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 6
},
"mds": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"rgw": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 3
},
"overall": {
"ceph version 18.2.8 (efac5a54607c13fa50d4822e50242b86e6e446df) reef (stable)": 18
}
}
Step 7: Clean up ceph-ansible artifacts
Only proceed with this step after completing the verification in
Step 7: ceph -s shows HEALTH_OK, all daemons
appear as running in ceph orch ps, and ceph versions reports a single consistent
version across all daemon types.
-
Remove old systemd unit files that are no longer used. Run on each Ceph node. Do not use a wildcard like
ceph-*.serviceas this would also remove the cephadm-managed units (e.g.ceph-<FSID>@.service). Remove only the legacy units:sudo rm -f /etc/systemd/system/ceph-mon@.service \/etc/systemd/system/ceph-mgr@.service \/etc/systemd/system/ceph-osd@.service \/etc/systemd/system/ceph-mds@.service \/etc/systemd/system/ceph-radosgw@.service \/etc/systemd/system/ceph-crash@.service \/etc/systemd/system/ceph-mon.target \/etc/systemd/system/ceph-mgr.target \/etc/systemd/system/ceph-osd.target \/etc/systemd/system/ceph-mds.target \/etc/systemd/system/ceph-radosgw.target \/etc/systemd/system/ceph.targetsudo rm -rf /etc/systemd/system/ceph.target.wantssudo systemctl daemon-reload -
The ceph-ansible configuration in the OSISM configuration repository (
environments/ceph/) can be kept as a reference but is no longer active.
Do not remove /etc/ceph/ceph.conf or any keyring files. These are still
used by the cluster and are now managed by cephadm.
Post-migration operations
After migration, Ceph lifecycle operations are performed through cephadm and the
Ceph orchestrator instead of osism apply ceph-* commands.
| ceph-ansible (before) | cephadm (after) |
|---|---|
osism apply ceph-mons | ceph orch apply mon |
osism apply ceph-mgrs | ceph orch apply mgr |
osism apply ceph-osds | ceph orch apply osd |
osism apply ceph-rgws | ceph orch apply rgw |
osism apply ceph-mdss | ceph orch apply mds |
Editing configuration.yml | ceph config set <section> <key> <val> |
osism apply ceph-rolling_update | ceph orch upgrade start --image <img> |
For the full cephadm operations reference, see the cephadm documentation.
Troubleshooting
Adoption fails with "daemon not found"
Ensure that the daemon is still running under the legacy systemd unit before attempting adoption. Check with:
sudo systemctl status ceph-<type>@<id>
OSD adoption fails with missing fsid or type file
In some containerized deployments, the files /var/lib/ceph/osd/ceph-<id>/fsid and
/var/lib/ceph/osd/ceph-<id>/type may not exist on the host. Cephadm requires these
files during adoption. Create them manually before retrying:
OSD_ID=<osd_id>
OSD_FSID=$(sudo ceph-volume lvm list ${OSD_ID} --format json | python3 -c "import json,sys; d=json.load(sys.stdin); print([e['tags']['ceph.osd_fsid'] for e in d['${OSD_ID}'] if e['type']=='block'][0])")
sudo mkdir -p /var/lib/ceph/osd/ceph-${OSD_ID}
echo "${OSD_FSID}" | sudo tee /var/lib/ceph/osd/ceph-${OSD_ID}/fsid
echo "bluestore" | sudo tee /var/lib/ceph/osd/ceph-${OSD_ID}/type
Container image pull fails
Verify that the node can reach the container registry. For OSISM environments:
sudo cephadm pull registry.osism.tech/osism/ceph-daemon:${CEPH_RELEASE}
If the pull fails, you may see an error like:
Pulling container image registry.osism.tech/osism/ceph-daemon:reef...
ERROR: failed to pull image: unable to pull image: Error initializing source docker://registry.osism.tech/osism/ceph-daemon:reef: error pinging docker registry registry.osism.tech: Get "https://registry.osism.tech/v2/": dial tcp 185.56.112.10:443: i/o timeout
This typically indicates a network connectivity issue. If the node is behind a proxy,
ensure the container runtime is configured to use it (e.g. via HTTP_PROXY/HTTPS_PROXY
in /etc/systemd/system/docker.service.d/proxy.conf or the equivalent for Podman).
Cluster health degrades during migration
Stop the migration and investigate. Common causes:
- A daemon was adopted on the wrong node.
- The container image version does not match the running Ceph version.
- Network connectivity issues between nodes.
Resolve the issue before continuing. An adopted daemon can be reverted by stopping the cephadm-managed unit and restarting the legacy systemd unit if it was preserved:
# Example: reverting an adopted OSD with ID 3
# Stop the cephadm-managed unit
sudo systemctl stop ceph-<FSID>@osd.3.service
sudo systemctl disable ceph-<FSID>@osd.3.service
# Restart the legacy systemd unit
sudo systemctl enable ceph-osd@3.service
sudo systemctl start ceph-osd@3.service
Replace <FSID> with the cluster's FSID, which can be found with ceph fsid. The same
pattern applies to other daemon types (replace osd.3 with e.g. mon.<hostname> or
mgr.<hostname>).
SSH connection issues
Cephadm requires SSH access to all nodes. Verify on the OSISM manager node:
ceph cephadm check-host <node>
Or use a loop to check all Ceph nodes at once:
for node in $(osism get hosts -l ceph | awk 'NR>3 && /\|/ {print $2}'); do
echo "=== ${node} ==="
ceph cephadm check-host ${node}
done
Ensure the cephadm SSH key has been distributed to all nodes (see Step 4).