Getting Started with Data Services Manager 2.0 – Part 8: DSM Appliance Restore

A common question I get asked when giving talks on Data Service Manager (DSM) version 2.0 is “what happens to the databases and data services when the DSM appliance / provider has an outage?” The simple answer is that nothing happens to your data services or databases – they will continue to run as before. Of course, without the DSM appliance, you do not have access to the UI to manage and monitor the databases, nor do you have access to the gateway API. Therefore it won’t be possible to provision new data services. So the next obvious question is how do I restore a DSM appliance should I have some sort of unrecoverable issue with it. This is what we are going to discuss in this post. Note that we typically refer to the DSM appliance as the provider, so you may find the terms used interchangeably in this post.

Note: At the time of writing, DSM version 2.0 is not yet generally available. Therefore some of the screenshots and command line outputs captured in this post may differ slightly in the final product. However, for the purposes of our getting started series of posts, this should not matter too much.

Overview

The sequence of steps involved in restoring the Data Services Manager appliance can be summarised in the following steps:

  1. Where possible, retrieve the /data/pgbackrest.conf from the original DSM provider. This file contains details about the provider’s backup configuration, including S3 object store, bucket, S3 credentials and file path to the backup. If this can not be retrieved, it can be manually created on the new DSM provider.
  2. Power off original DSM appliance.
  3. Remove the DSM plugin from the vCenter server via the vSphere Client > Administration > Solutions > Client Plugins > Data Services Manager Plugin > Remove.
  4. Deploy a new DSM appliance (using the same version as the original appliance) via the vSphere Client > Administration > Solutions > Client Plugins > Add. Ensure the configuration details of the new OVA match the original DSM OVA (e.g. networking).
  5. Login to the new DSM appliance and setup the pgbackrest.conf so that the restore tool knows where to retrieve the provider backup from.
  6. Initiate the restore of the DSM appliance/provider using the restore-provider tool and the pgbackrest.conf configuration.

Unregister the DSM Plugin using docker

In step 3 above, it mentions that the DSM plugin for the original DSM provider must be removed from vCenter server. You can do this simply via the UI, but I also wanted to show how to do it using a docker command from the provider command line. To run this docker command, you must login to the original DSM appliance. The command requires a number of configuration items, including the vCenter server SHA-256 thumbprint (-vct). The vCenter thumbprint can be found by clicking on the icon immediately before the URL in the web browser connected to the vSphere Client. Click on the Certificate View to bring up the Certificate details. From here you can find the SHA-256 fingerprint/thumbprint for the vCenter server. I have also included the -insecure flag in this command since my vCenter server is using self-signed certs and not a custom certificate. Note that the version used in extension-registration:<version> will be different in your environment since this example is using pre-GA code. You can get the version from the docker images command.

# docker images | grep extension-reg
extension-registration                      2.0.0-23127626      4c067666dd38  8 days ago      179MB


# docker run --name extension_registration \
--rm extension-registration:2.0.0-23127626 \
-action unregisterPlugin \
-insecure \
-url https://<vCenter IP or FQDN>/sdk \
-vct 17:8e:93:56:a0:f6:8c:53:22:3d:ea:bf:aa:42:f8:f1:34:32:04:7c:89:02:76:2f:80:2d:61:ba:e4:77:8f:70 \ 
-username 'administrator@vsphere.local' \
-password '********' \
-key com.vmware.dsm.plugin

Configure pgbackrest.conf

Now that the plugin has been removed successfully from vCenter server, we can proceed with deploying a new DSM appliance. We can then initiate the restore of the original DSM provider content to this new DSM provider. The first step is to configure the pgbackrest.conf file found in /data on the new DSM appliance. There will be a pgbackrest.conf template already in the /data directory when you login to the appliance. This will need to be configured to point to the S3 compliant Object Store that you previously configured for Provider backups, as well as the bucket name and access credentials. Here is an example from my environment:

[global]
repo1-path=/provider-backups-5ce48c2a-82b3-49c0-a96d-faa0bf17d612
repo1-type=s3
repo1-s3-endpoint=https://192.168.0.1:9000
repo1-s3-bucket=provider-backup
repo1-s3-uri-style=path
repo1-s3-verify-tls=n
repo1-s3-key=admin
repo1-s3-key-secret=password
repo1-s3-region=us-east-1

repo1-retention-full=7
process-max=2
log-level-console=info
log-level-file=error
start-fast=y
delta=y

[main]
pg1-path=/data/vpgsql

As mentioned, if you had access to the original DSM appliance beforehand, then you could retrieve the /data/pgbackrest.conf from there and copy it to this new DSM appliance. When the provider backup is configured on a DSM appliance, it is the pgbackrest.conf that holds all of the details relating to the provider backups.

If you are building this file from scratch, you will have to figure out what the repo1-path is. To find the correct repo1-path, navigate to the object storage interface, locate the repo1-s3-bucket used for provider backups, and then check the name of the folder underneath. If there are multiple folders, continue navigating to the backup/main sub-folders of each, and look for the date and time on the backup.info. This will tell you if this is the folder containing the latest backup and that folder is the one which should be set in the repo1-path field. Here is an example to show what I mean, taken from my MinIO Object Storage system. This is the correct folder (highlighted in the blue box) as the most recent backups were just taken at midnight, as shown in red.

Run restore-provider

Using the configured pgbackrest.conf, the restore-provider tool can now be run to restore the original DSM provider contents to this newly provisioned Provider. There is a lot of output, which I have truncated below. The restore process usually takes approx 8 to 10 minutes to complete in my experience.

# restore-provider -c /data/pgbackrest.conf
  .   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::               (v2.7.13)
 
2024-01-08 12:04:15.799  INFO [main           ] p.r.ProviderRestoreApplication - Starting ProviderRestoreApplication v2.0.0 using Java 11.0.20.1 on dsm-provider.rainpole.com with PID 8208 (/opt/vmware/tdm-provider/restore-service/bin/tdm-sp-provider-restore.jar started by root in /opt/vmware/tdm-provider/restore-service)
2024-01-08 12:04:15.804 DEBUG [main           ] p.r.ProviderRestoreApplication - Running with Spring Boot v2.7.13, Spring v5.3.28
2024-01-08 12:04:15.804  INFO [main           ] p.r.ProviderRestoreApplication - The following 1 profile is active: "restore"
2024-01-08 12:04:17.452  INFO [main           ] epositoryConfigurationDelegate - Bootstrapping Spring Data JPA repositories in DEFAULT mode.
2024-01-08 12:04:18.563  INFO [main           ] epositoryConfigurationDelegate - Finished Spring Data repository scanning in 1096 ms. Found 98 JPA repository interfaces.
2024-01-08 12:04:20.349  INFO [main           ] o.s.b.w.e.t.TomcatWebServer    - Tomcat initialized with port(s): 9999 (http)
2024-01-08 12:04:20.368  INFO [main           ] o.a.c.core.StandardService     - Starting service [Tomcat]
2024-01-08 12:04:20.368  INFO [main           ] o.a.c.core.StandardEngine      - Starting Servlet engine: [Apache Tomcat/9.0.76]
2024-01-08 12:04:20.483  INFO [main           ] o.a.c.c.C.[.[localhost].[/]    - Initializing Spring embedded WebApplicationContext
2024-01-08 12:04:20.484  INFO [main           ] letWebServerApplicationContext - Root WebApplicationContext: initialization completed in 4579 ms
2024-01-08 12:04:20.702  INFO [main           ] c.v.t.s.c.DataSourceConfig     - jdbc:postgresql:vmware
2024-01-08 12:04:20.706 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __INIT__setDatasourceConfig - key=spring.datasource.hikari.maximum-pool-size
2024-01-08 12:04:20.706  INFO [main           ] c.v.t.s.c.DataSourceConfig     - set SpringConfig: key=spring.datasource.hikari.maximum-pool-size, value=5
2024-01-08 12:04:20.706 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __DONE__setDatasourceConfig - key=spring.datasource.hikari.maximum-pool-size
2024-01-08 12:04:20.707 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __INIT__setDatasourceConfig - key=spring.datasource.hikari.connectionTimeout
2024-01-08 12:04:20.707  INFO [main           ] c.v.t.s.c.DataSourceConfig     - set SpringConfig: key=spring.datasource.hikari.connectionTimeout, value=20000
2024-01-08 12:04:20.707 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __DONE__setDatasourceConfig - key=spring.datasource.hikari.connectionTimeout
2024-01-08 12:04:20.707 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __INIT__setDatasourceConfig - key=spring.datasource.hikari.minimum-idle
2024-01-08 12:04:20.708  INFO [main           ] c.v.t.s.c.DataSourceConfig     - set SpringConfig: key=spring.datasource.hikari.minimum-idle, value=5
2024-01-08 12:04:20.708 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __DONE__setDatasourceConfig - key=spring.datasource.hikari.minimum-idle
2024-01-08 12:04:20.708 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __INIT__setDatasourceConfig - key=spring.datasource.hikari.idle-timeout
2024-01-08 12:04:20.708  INFO [main           ] c.v.t.s.c.DataSourceConfig     - set SpringConfig: key=spring.datasource.hikari.idle-timeout, value=60000
2024-01-08 12:04:20.708 DEBUG [main           ] c.v.t.s.c.DataSourceConfig     - __DONE__setDatasourceConfig - key=spring.datasource.hikari.idle-timeout
2024-01-08 12:04:20.719  WARN [main           ] com.zaxxer.hikari.HikariConfig - StandaloneHikariPool - idleTimeout has been set but has no effect because the pool is operating as a fixed size pool.
2024-01-08 12:04:20.721  INFO [main           ] c.z.hikari.HikariDataSource    - StandaloneHikariPool - Starting...
2024-01-08 12:04:20.861  INFO [main           ] c.z.hikari.HikariDataSource    - StandaloneHikariPool - Start completed.
2024-01-08 12:04:21.364  INFO [main           ] o.h.j.internal.util.LogHelper  - HHH000204: Processing PersistenceUnitInfo [name: default]
2024-01-08 12:04:21.488  INFO [main           ] org.hibernate.Version          - HHH000412: Hibernate ORM core version 5.6.15.Final
2024-01-08 12:04:21.776  INFO [main           ] o.h.annotations.common.Version - HCANN000001: Hibernate Commons Annotations {5.1.2.Final}
2024-01-08 12:04:22.052  INFO [main           ] org.hibernate.dialect.Dialect  - HHH000400: Using dialect: org.hibernate.dialect.PostgreSQLDialect
2024-01-08 12:04:22.540  WARN [main           ] Hibernate Types                - You should use Hypersistence Optimizer to speed up your Hibernate application!
2024-01-08 12:04:22.541  WARN [main           ] Hibernate Types                - For more details, go to https://vladmihalcea.com/hypersistence-optimizer/
2024-01-08 12:04:22.541  INFO [main           ] Hibernate Types                -
_    _                           _     _
| |  | |                         (_)   | |
| |__| |_   _ _ __   ___ _ __ ___ _ ___| |_ ___ _ __   ___ ___
|  __  | | | | '_ \ / _ \ '__/ __| / __| __/ _ \ '_ \ / __/ _ \
| |  | | |_| | |_) |  __/ |  \__ \ \__ \ ||  __/ | | | (_|  __/
|_|  |_|\__, | .__/ \___|_|  |___/_|___/\__\___|_| |_|\___\___|
         __/ | |
        |___/|_|
 
           ____        _   _           _
          / __ \      | | (_)         (_)
         | |  | |_ __ | |_ _ _ __ ___  _ _______ _ __
         | |  | | '_ \| __| | '_ ` _ \| |_  / _ \ '__|
         | |__| | |_) | |_| | | | | | | |/ /  __/ |
          \____/| .__/ \__|_|_| |_| |_|_/___\___|_|
                | |
                |_|

2024-01-08 12:04:22.542  INFO [main           ] Hibernate Types                - Check out the README page for more info about the Hypersistence Optimizer banner https://github.com/vladmihalcea/hibernate-types#how-to-remove-the-hypersistence-optimizer-banner-from-the-log
2024-01-08 12:04:24.421  INFO [main           ] e.t.j.p.i.JtaPlatformInitiator - HHH000490: Using JtaPlatform implementation: [org.hibernate.engine.transaction.jta.platform.internal.NoJtaPlatform]
2024-01-08 12:04:24.435  INFO [main           ] tainerEntityManagerFactoryBean - Initialized JPA EntityManagerFactory for persistence unit 'default'
2024-01-08 12:04:25.780  INFO [main           ] rgetType$TargetFetcherInjector - Injecting targetFetcher : SERVICE_INSTANCE
2024-01-08 12:04:25.780  INFO [main           ] rgetType$TargetFetcherInjector - Injecting targetFetcher : TASK
2024-01-08 12:04:25.782  INFO [main           ] rgetType$TargetFetcherInjector - Injecting targetFetcher : S3_STORAGE
2024-01-08 12:04:25.782  INFO [main           ] rgetType$TargetFetcherInjector - Injecting targetFetcher : PROVIDER_ENVIRONMENT
2024-01-08 12:04:25.782  INFO [main           ] rgetType$TargetFetcherInjector - Injecting targetFetcher : SERVICE_INSTANCE_GROUP
2024-01-08 12:04:25.827  INFO [main           ] c.v.t.c.w.c.SSLConfig          - Setting java truststore to: /opt/vmware/tdm-provider/cert/truststore.jks
.
.
.
2024-01-08 12:11:34.095 P00   INFO: execute non-exclusive backup start: backup begins after the requested immediate checkpoint completes
2024-01-08 12:11:35.596 P00   INFO: backup start archive = 0000000200000000000000A8, lsn = 0/A8000028
2024-01-08 12:11:35.596 P00   INFO: check archive for prior segment 0000000200000000000000A7
WARN: a timeline switch has occurred since the 20240104-123916F_20240108-000001I backup, enabling delta checksum
      HINT: this is normal after restoring from backup or promoting a standby.
2024-01-08 12:11:44.690 P00   INFO: execute non-exclusive backup stop and wait for all WAL segments to archive
2024-01-08 12:11:45.391 P00   INFO: backup stop archive = 0000000200000000000000A8, lsn = 0/A802C8A0
2024-01-08 12:11:45.428 P00   INFO: check archive for segment(s) 0000000200000000000000A8:0000000200000000000000A8
2024-01-08 12:11:45.491 P00   INFO: new backup label = 20240104-123916F_20240108-121133I
2024-01-08 12:11:45.916 P00   INFO: incr backup size = 168.6MB, file total = 2163
2024-01-08 12:11:45.916 P00   INFO: backup command end: completed successfully (12983ms)]
2024-01-08 12:11:47.841  INFO [pool-4-thread-1] c.v.t.c.tools.CommandRunner    - Result ExitCode: 0
2024-01-08 12:11:47.842 DEBUG [pool-4-thread-1] b.s.StateTriggerProviderBackup - ___DONE___stateTriggerProviderBackup___PROVIDER_UPDATE
2024-01-08 12:11:47.842  INFO [pool-4-thread-1] c.v.t.c.t.w.SimpleWorkflowFSM  -  DONE FSM execution <<
2024-01-08 12:11:47.843 DEBUG [pool-4-thread-1] c.v.t.c.t.w.BaseSimpleWorkLoad - __DONE____________workflow {} class com.vmware.tdm.sp.provider.common.backup.workflow.ProviderBackupWorkLoad
2024-01-08 12:12:01.690 DEBUG [main           ] ProviderBackupTaskAsyncMonitor - ___RETRY Fetch Provider backup task status. TaskId - [1] Status - [SUCCESS]
2024-01-08 12:12:01.693  INFO [main           ] v.t.s.p.c.a.AuditActionHandler - __INIT__onSuccess. Source - PROVIDER_BACKUP, OperationType - {}
2024-01-08 12:12:01.694  INFO [main           ] v.t.s.p.c.a.AuditActionHandler - __DONE__onSuccess PROVIDER_BACKUP
2024-01-08 12:12:01.699  INFO [audit-exec-2   ] c.v.t.s.p.c.a.AuditProcessor   - AuditEntrySaved: AuditEntity(id=22e54a04-9814-45ad-97fc-649e7ba950ee, source=System, component=PROVIDER, operationType=PROVIDER_BACKUP, subject=null, details=Provider Backup Success, eventTime=2024-01-08 12:12:01.693, result=OK)
2024-01-08 12:12:01.700 DEBUG [main           ] ProviderBackupTaskAsyncMonitor - ___DONE__monitorBackupTask___TaskId - [1]
2024-01-08 12:12:01.700  INFO [main           ] .s.p.c.b.ProviderBackupService - ___DONE__createProviderBackup___
2024-01-08 12:12:01.700  INFO [main           ] c.v.t.s.p.r.c.RestoreDbCommand - __DONE__Saving provider backup settings to Recovered Provider
2024-01-08 12:12:01.710  INFO [main           ] o.a.c.core.StandardService     - Stopping service [Tomcat]
2024-01-08 12:12:01.719  WARN [main           ] o.a.c.l.WebappClassLoaderBase  - The web application [ROOT] appears to have started a thread named [StandaloneHikariPool housekeeper] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
java.base@11.0.20.1/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.20.1/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base@11.0.20.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.base@11.0.20.1/java.lang.Thread.run(Unknown Source)
2024-01-08 12:12:01.792  INFO [main           ] tainerEntityManagerFactoryBean - Closing JPA EntityManagerFactory for persistence unit 'default'

 And there you have it. In just a couple of steps, the DSM appliance/provider is restored back to where it was previously, with all of the infrastructure policy information, all of the user/permission information and all of the data services and databases information available once again. It will also restore all configuration settings that were previously put in place, such as Log Forwarding, SMTP configuration, webhook configuration and LDAP settings. The dashboard view should be exactly the same on the new provider as it appeared on the original provider.

Register the DSM Plugin using docker

For completeness, I also wanted to show how to register the DSM plugin with vCenter using the docker CLI, even though this step is not necessary in the restore process. However, you might need to do this if the IP address of the DSM appliance changes, or there is a new certificate installed on the DSM appliance.

The DSM thumbprint (in the -serverThumbprint field) must be SHA-1 in DSM 2.0, and the fields must be ‘:’ separated. This is a nuance of the vCenter API. The easiest way that I have found to retrieve the thumbprint in this ‘:’ separated format is via the Firefox browser. Similar to how the vCenter thumbprint was retrieved, click on the icon immediately before the DSM URL in the browser. Click on Connection > More information > View Certificate and from there you can retrieve the SHA-1 thumbprint.

Also note that the docker command to register the DSM plugin also includes the -insecure flag here since my vCenter server is using self-signed certs and not a custom certificate. And as already highlighted, the version used in extension-registration:<version> and -version field may vary depending on the DSM version that is being used. The DSM version is displayed in the “Version & Upgrade” section of the DSM UI.

# docker run \ 
--name extension_registration \ 
--rm extension-registration:2.0.0-23127626 \ 
-action registerPlugin \ 
-remote \ 
-insecure \ 
-url https://vcsa-06.rainpole.com/sdk \ 
-vct 17:8e:93:56:a0:f6:8c:53:22:3d:ea:bf:aa:42:f8:f1:34:32:04:7c:89:02:76:2f:80:2d:61:ba:e4:77:8f:70 \ 
-username 'administrator@vsphere.local' \ 
-password '*******' \ 
-key com.vmware.dsm.plugin \ 
-version 2.0.0.3730 \ 
-pluginUrl https://<IP of Provider Appliance>/provider/plugin/plugin_signed.zip \ 
-serverThumbprint DB:02:BD:A8:19:10:4B:2B:86:39:69:D7:C6:B1:26:FF:A9:E5:B0:29 \ 
-c 'VMware, Inc.' \ 
-n 'Data Services Manager Plugin' \ 
-s 'DSM solution appliance with remote plugin'

That complete the post on restoring the Data Services Manager. Thanks for reading this far. Check out my other posts on DSM 2.0.