Physical replication in PostgreSQL 10 on GNU/Linux

Some time ago I had to install and setup a PostgreSQL cluster for a very critical application, an Warehouse Management System (WMS) developed internally for an ecommerce startup in which I worked.

I would like to share the steps I followed back in those days to setup the proper configuration, I have to say that I was a lucky guy because the Dev Lead who managed the Dev Team in this case he had already worked with PostgreSQL in the past, so our meetings to define Dev Team requirements were pretty easy.

Some context

From our meetings, I had to understand what were the specific requirements needed for such a specific peace of software as a WMS is, in a regular situation this software is provided by an external Warehouse company, but in this case the relationship with this startup was so good with Warehouse company that they reached an agreement to include a very custom WMS inside the Warehouse provider infrastructure, which it’s not very common situation.

Apart from that the requirements for the PostgreSQL cluster were pretty clear:

  • Not expecting a high load we only needed two nodes, one master node and one replica in standby mode
  • Regarding replication strategy we decided to use an Asynchronous streaming replication

Setting up the MASTER

Create archive directory:

db01:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive
db01:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive

Edit postgresql.conf configuration file:

...
listen_addresses = '*'         # what IP address(es) to listen on;
wal_level = hot_standby        # minimal, archive, hot_standby, or logical
archive_mode = on              # enables archiving; off, on, or always
archive_command = 'cp %p /opt/postgresql/10/main/archive/%f'	  
max_wal_senders = 5            # max number of walsender processes
wal_keep_segments = 32         # in logfile segments, 16MB each; 0 disables
hot_standby = on               # "on" allows queries during recovery
...

Setting up the REPLICA

The configuration file postgresql.conf was the same as on the master node in our case. It was needed to create an archive directory:

db02:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive
db02:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive

It was also required to create a recovery.conf file with this content in our DATA DIRECTORY:

# This tells the slave to keep pulling WALs from master.
standby_mode = on

# This is how to connect to the master.
primary_conninfo = 'host=10.240.0.XX user=repl password=YYYYY'

trigger_file = '/opt/postgresql/10/main/archive/failover_db_trigger'

restore_command = 'cp /opt/postgresql/10/main/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /opt/postgresql/10/main/archive %r'

As an additional setup I included an immutability check for this configuration file:

db02:/opt/postgresql/10/main# chattr +i recovery.conf 

Verification

To verify that everything was working from replication perspective we made some tests:

MASTER

postgres=# create database foo;
CREATE DATABASE

postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 foo       | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)

STANDBY SERVER

postgres=# create database foo;
ERROR:  cannot execute CREATE DATABASE in a read-only transaction

To verify replication status:
# sudo -u postgres psql -c "select * from pg_stat_replication";

Troubleshooting

If you face some issues as we did something similar to this:

2018-05-22 09:34:54.839 UTC [9724] LOG:  record with incorrect prev-link 0/70013E8 at 0/7001420
cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory
2018-05-22 09:34:59.844 UTC [9724] LOG:  record with incorrect prev-link 0/70013E8 at 0/7001420
cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory
2018-05-22 09:35:04.851 UTC [9724] LOG:  record with incorrect prev-link 0/70013E8 at 0/7001420

We followed these steps to fix the issue (note 22960 port was a custom port where SSH server was running):

0. Stop postgresql on standby 
standby# systemctl stop  postgresql 

1. invoke pg_start_backup() on master 
master# sudo -u postgres psql -c "SELECT pg_start_backup('`date +'%Y-%m-%d %H:%M'`')"

2. remove data files on standby 
standby# /bin/rm -rf /opt/postgresql/10/main/* 

3. copy data files 
master# rsync -e 'ssh -p 22960' -av --delete /opt/postgresql/10/main pro-wms-db-g02:/opt/postgresql/10/ --exclude=pg_xlog --exclude=postmaster.pid  --exclude=server.crt --exclude=server.key 

4. invoke pg_stop_backup() on master 
master# sudo -u postgres psql -c "SELECT pg_stop_backup()" 

5. start postgresql on standby 
standby# systemctl start postgresql

Reference links:


“It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you’ll do things differently”
— Warren Buffett

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s