DSM 2.2: Onboarding existing Postgres Databases into DSM

One feature that many of our customers have asked about is the ability to ingest existing Postgres databases into Data Services Manager (DSM). As part of DSM 2.2, our team has put together a step by step guide into how to achieve this using the logical replication feature available in Postgres. The guide is included here as part of the official DSM 2.2 documentation. I decided to build a stand-alone Postgres database and follow the instructions to get the database into Data Services Manager. I built a Postgres v15.10 database on a VM running Debian 12 (bookworm). I picked up a sample learning database (called chinook) just to have something to test against. This will be my source database. I also created a new Postgres v15.10 database in my DSM environment called cjh-pg-01. This will be my target database. You can check the version of your databases by using the psql command “select VERSION();”

Considerations: This process does not support the onboarding of databases that use extensions that are unsupported by DSM provisioned Postgres databases. To check the list of extensions installed on your source database, use the psql command “select * from pg_extension;” or the shorthand “\dx” command. To see the list of supported extensions on a DSM provisioned Postgres database, use psql command “select * from pg_available_extensions;“. Ensure that extensions use by your source database are supported on the target (DSM) database.

Even though Postgres Logical Replication is quite mature and battle-hardened at this point, a common question is whether the databases should be in some sort of maintenance mode, or if connections to the source database should be disabled during the replication. The answer is that this is left up to the customers own judgement, but in my opinion, I think it is safer to approach it as a maintenance task. Connections to the source (and target database) can be prevented through your firewall, or via the pg_hba.conf file. Disabling activity on the database is worth considering in the following cases:

If the source database experiences heavy writes. Logical replication uses more resources than physical replication, so some slowness or additional delay in replication might be experience under those conditions.
If the application performs DDL statements (create, drop, etc) on the database. Those statements are not replicated as part of logical replication.

We also strongly recommend that readers should try out the migration process in a non-critical environment that is the most similar to production, before testing the process on a production database.

Let’s see how to onboard this database into DSM using the instructions in the official documentation referenced above.

Step 1. Set wal_level to logical on source database

What is this, and why do we need to change it? wal_level determines how much information is written to the WAL (Write Ahead Logs). The default value is replica, which writes enough data to support WAL archiving and replication, including running read-only queries on a standby server. This is done via continuous archiving (archive_mode) and streaming binary replication. Changing wal_level to logical adds information necessary to support “logical decoding”. The logical wal_level logs the same information as replica, but also includes additional information needed to extract logical change sets from the WAL. Since this is what we are going to do during the onboarding of the database, we need to set wal_level to logical. (More details can be found on wal_level here).

If wal_level needs to be changed from replica to logical, it requires a restart of the database. It is not enough to just just alter the system setting. Login into the source database with superuser privileges and alter the wal_level.

chinook=# SHOW wal_level;
wal_level
-----------
replica
(1 row)

chinnok=# ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM

chinook=# SHOW wal_level;
wal_level
-----------
replica
(1 row)

chinook=# \q

This should add the change to the postgres.conf configuration file. Now restart the Postgres database running on Debian 12 (bookworm) for the change to take effect. Again, this is my source database that I plan to import into a DSM database.

$ sudo systemctl restart postgresql
$ sudo systemctl status postgresql

● postgresql.service - PostgreSQL RDBMS
     Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; preset: enabled)
     Active: active (exited) since Tue 2025-02-11 13:46:51 GMT; 9s ago
    Process: 38355 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 38355 (code=exited, status=0/SUCCESS)
        CPU: 2ms

Feb 11 13:46:51 debian-pg systemd[1]: Starting postgresql.service - PostgreSQL RDBMS...
Feb 11 13:46:51 debian-pg systemd[1]: Finished postgresql.service - PostgreSQL RDBMS.

Now log back into the source database and verify the new wal_level.

chinook=# SHOW wal_level;
wal_level
-----------
logical
(1 row)

Step 2. Extract schema from source database

The next step is to extract the schema from the ‘chinook’ database. To do this step, the pg_dump command is used. pg_dump is a utility for backing up a PostgreSQL database. Some options used are:

–schema-only: Dump only the object definitions (schema), not data.
–clean: Output commands to DROP all the dumped database objects prior to outputting the commands for creating them.
–if-exists: Use DROP … IF EXISTS commands to drop objects in –clean mode.
–no-privileges: Prevent dumping of access privileges (grant/revoke commands).
–no-subscriptions: Do not dump subscriptions. We will see how to add a subscription shortly.
–no-publications: Do not dump publications. We will see how to add a publication shortly.

More details on pgdump are here. Be sure to examine the output of the pg_dump command output for errors, even though they may be benign/harmless. Afterwards, you can check the output.log for details about how the schema will be recreated in the DSM target Postgres database. I ran this command when I was logged into the Debian VM as the postgres user. Note that no data is copied in this command, only the schema of the source database. The steps to recreate the schema will be stored in the file /tmp/output.log.

$ pg_dump "user=postgres password='source-password' dbname=chinook" \
--schema-only --clean --if-exists --no-privilege --no-subscriptions \
--no-publications > /tmp/output.log


$ more /tmp/output.log
--
-- PostgreSQL database dump
--

-- Dumped from database version 15.10 (Debian 15.10-1.pgdg120+1)
-- Dumped by pg_dump version 15.10 (Debian 15.10-1.pgdg120+1)

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

ALTER TABLE IF EXISTS ONLY public.track DROP CONSTRAINT IF EXISTS track_media_type_id_fkey;
ALTER TABLE IF EXISTS ONLY public.track DROP CONSTRAINT IF EXISTS track_genre_id_fkey;
ALTER TABLE IF EXISTS ONLY public.track DROP CONSTRAINT IF EXISTS track_album_id_fkey;
ALTER TABLE IF EXISTS ONLY public.playlist_track DROP CONSTRAINT IF EXISTS playlist_track_track_id_fkey;
ALTER TABLE IF EXISTS ONLY public.playlist_track DROP CONSTRAINT IF EXISTS playlist_track_playlist_id_fkey;
ALTER TABLE IF EXISTS ONLY public.invoice_line DROP CONSTRAINT IF EXISTS invoice_line_track_id_fkey;
ALTER TABLE IF EXISTS ONLY public.invoice_line DROP CONSTRAINT IF EXISTS invoice_line_invoice_id_fkey;
ALTER TABLE IF EXISTS ONLY public.invoice DROP CONSTRAINT IF EXISTS invoice_customer_id_fkey;
.
.
<truncated>

Step 3. Create schema on target (DSM) database

The next step in the process is to restore the schema from the backup of the source database to the DSM target database. This is done by simply connecting to the database and redirecting the captured schema to the connection. Check for errors during process, as this is where unsupported extensions will be reported.

$ psql postgresql://pgadmin:target-password@target-ip-address:5432/cjh-pg-01 \
< /tmp/output.log
SET
SET
SET
SET
SET
set_config
------------
(1 row)

SET
SET
SET
SET
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE.
.
.
<truncated>

This should result in the schema from the “chinook” database being reproduced on the “cjh-pg-01” database. Lets check the schema on the target. The “\d” command in Postgres will show the tables and they should match the schema of the original database.

$ psql postgresql://pgadmin:target-password@target-ip-address:5432/cjh-pg-01
psql (15.10 (Debian 15.10-1.pgdg120+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

cjh-pg-01=# \d
                List of relations
Schema |       Name       |   Type   |  Owner   
--------+------------------+----------+----------
public | album            | table    | postgres
public | artist           | table    | postgres
public | customer         | table    | postgres
public | documents        | table    | postgres
public | documents_id_seq | sequence | postgres
public | employee         | table    | postgres
public | genre            | table    | postgres
public | invoice          | table    | postgres
public | invoice_line     | table    | postgres
public | media_type       | table    | postgres
public | playlist         | table    | postgres
public | playlist_track   | table    | postgres
public | track            | table    | postgres
(13 rows)

This looks good. Now I can proceed with copying the data.

Step 4: Create PUBLICATION on the source database

Now to sync the data using publication and subscription method. A publication is a set of changes generated from a table or a group of tables, and might also be described as a change set or replication set. Publications may currently only contain tables or all tables in schema. Objects must be added explicitly, except when a publication is created for ALL TABLES. More details on publications are available here. ALL TABLES is used as part of this process for DSM database ingestion, so must be added to the source databases as shown below. The publication is called onboard_to_dsm. This change set or replication set will hold changes made to the tables in this database.

chinook=# create publication onboard_to_dsm for all tables;
CREATE PUBLICATION

Step 5. Create SUBSCRIPTION on target (DSM) database

With the publication is place, we can now add a subscription. A subscription is the downstream side of logical replication (to a publication). The subscription is called onboard_to_cjh_pg_01. Each subscription will receive changes via one replication slot. The slot is created as part of the subscription. The slot name in this example is also called onboard_to_cjh_pg_01. The schema definitions are not replicated, which is why we had to manually create the tables using a backup from the source database earlier. Note that the published tables must exist on the subscriber before a subscription can be created.

cjh-pg-01=# CREATE SUBSCRIPTION onboard_to_cjh_pg_01 \CONNECTION 'dbname=chinook host=ip-address-of-source-db port=5432 \user=postgres password=source-password sslmode=prefer' PUBLICATION onboard_to_dsm \WITH (create_slot = true, slot_name = onboard_to_cjh_pg_01, enabled=true);
NOTICE:  created replication slot "onboard_to_cjh_pg_01" on publisher
CREATE SUBSCRIPTION

Step 6. Check the replication status on target database

The official documentation recommend using various catalogs, such as pg_subscription_rel, pg_class & pg_stat_activity, to check ont he replication status. The catalog pg_subscription_rel contains the state for each replicated relation in each subscription. The catalog pg_class describes tables and other objects that have columns or are otherwise similar to a table. The catalog pg_stat_activity can report details about the process ID, active user, currently running query, activity status, time the last query started, etc.

The field we are most interested in is the subscription state. The following are the state codes:

i = initialize
d = data is being copied
f = finished table copy
s = synchronized
r = ready (normal replication)

The state can also be checked via the pg_stat_activity catalog, checking the logical replication apply processes (More details here). Idle implies that there are none. These status suggest that the data has been synchronised/replicated.

cjh-pg-01=# select srsubid, class.relname, srsubstate from pg_subscription_rel \join pg_class class on oid = srrelid;
  138080 | employee       | r
  138080 | customer       | r
  138080 | invoice        | r
  138080 | invoice_line   | r
  138080 | media_type     | r
  138080 | track          | r
  138080 | playlist       | r
  138080 | genre          | r
  138080 | playlist_track | r
  138080 | artist         | r
  138080 | album          | r

cjh-pg-01=# select datname,state from pg_stat_activity where wait_event = 'LogicalApplyMain';
cjh-pg-01 | idle

Step 7. Check the data on the target database

Check on the destination database (target) that the data has been synchronised. Here is a quick check on one of my table contents.

cjh-pg-01=# select * from artist wherename LIKE '%Zappa%';
23 | Frank Zappa & Captain Beefheart

Nice! Looks like my data is present. You will probably want to do some additional testing on your database to validate that everything has indeed been replicated.

Step 8. Tidy Up. Drop the PUBLICATION & SUBSCRIPTION

Remove the subscription from the target database and the publication from the source database as they are no longer needed.

cjh-pg-01=# drop subscription if exists onboard_to_cjh_pg_01;
NOTICE:  dropped replication slot "onboard_to_cjh_pg_01" on publisher
DROP SUBSCRIPTION

chinook=# drop publication onboard_to_dsm;
DROP PUBLICATION
chinook=#

Congratulations! You have now successfully onboarded your Postgres database into DSM. You can now leverage all of the capabilities available in DSM, such as automatic backups, automated patching of minor versions, and major version upgrades. You can also adjust the topology of the database to include cross-cluster for availability, or even replicate the database to a secondary database for disaster recover purposes.