Some time ago I had to install and setup a PostgreSQL cluster for a very critical application, an Warehouse Management System (WMS) developed internally for an ecommerce startup in which I worked.
I would like to share the steps I followed back in those days to setup the proper configuration, I have to say that I was a lucky guy because the Dev Lead who managed the Dev Team in this case he had already worked with PostgreSQL in the past, so our meetings to define Dev Team requirements were pretty easy.
Some context
From our meetings, I had to understand what were the specific requirements needed for such a specific peace of software as a WMS is, in a regular situation this software is provided by an external Warehouse company, but in this case the relationship with this startup was so good with Warehouse company that they reached an agreement to include a very custom WMS inside the Warehouse provider infrastructure, which it’s not very common situation.
Apart from that the requirements for the PostgreSQL cluster were pretty clear:
- Not expecting a high load we only needed two nodes, one master node and one replica in standby mode
- Regarding replication strategy we decided to use an Asynchronous streaming replication
Setting up the MASTER
Create archive directory:
db01:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive
db01:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive
Edit postgresql.conf configuration file:
...
listen_addresses = '*' # what IP address(es) to listen on;
wal_level = hot_standby # minimal, archive, hot_standby, or logical
archive_mode = on # enables archiving; off, on, or always
archive_command = 'cp %p /opt/postgresql/10/main/archive/%f'
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 32 # in logfile segments, 16MB each; 0 disables
hot_standby = on # "on" allows queries during recovery
...
Setting up the REPLICA
The configuration file postgresql.conf was the same as on the master node in our case. It was needed to create an archive directory:
db02:/etc/postgresql/10/main# mkdir /opt/postgresql/10/main/archive
db02:/etc/postgresql/10/main# chown postgres:postgres /opt/postgresql/10/main/archive
It was also required to create a recovery.conf file with this content in our DATA DIRECTORY:
# This tells the slave to keep pulling WALs from master.
standby_mode = on
# This is how to connect to the master.
primary_conninfo = 'host=10.240.0.XX user=repl password=YYYYY'
trigger_file = '/opt/postgresql/10/main/archive/failover_db_trigger'
restore_command = 'cp /opt/postgresql/10/main/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /opt/postgresql/10/main/archive %r'
As an additional setup I included an immutability check for this configuration file:
db02:/opt/postgresql/10/main# chattr +i recovery.conf
Verification
To verify that everything was working from replication perspective we made some tests:
MASTER
postgres=# create database foo;
CREATE DATABASE
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
foo | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)
STANDBY SERVER
postgres=# create database foo;
ERROR: cannot execute CREATE DATABASE in a read-only transaction
To verify replication status:
# sudo -u postgres psql -c "select * from pg_stat_replication";
Troubleshooting
If you face some issues as we did something similar to this:
2018-05-22 09:34:54.839 UTC [9724] LOG: record with incorrect prev-link 0/70013E8 at 0/7001420
cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory
2018-05-22 09:34:59.844 UTC [9724] LOG: record with incorrect prev-link 0/70013E8 at 0/7001420
cp: cannot stat '/opt/postgresql/10/main/archive/000000010000000000000007': No such file or directory
2018-05-22 09:35:04.851 UTC [9724] LOG: record with incorrect prev-link 0/70013E8 at 0/7001420
We followed these steps to fix the issue (note 22960 port was a custom port where SSH server was running):
0. Stop postgresql on standby
standby# systemctl stop postgresql
1. invoke pg_start_backup() on master
master# sudo -u postgres psql -c "SELECT pg_start_backup('`date +'%Y-%m-%d %H:%M'`')"
2. remove data files on standby
standby# /bin/rm -rf /opt/postgresql/10/main/*
3. copy data files
master# rsync -e 'ssh -p 22960' -av --delete /opt/postgresql/10/main pro-wms-db-g02:/opt/postgresql/10/ --exclude=pg_xlog --exclude=postmaster.pid --exclude=server.crt --exclude=server.key
4. invoke pg_stop_backup() on master
master# sudo -u postgres psql -c "SELECT pg_stop_backup()"
5. start postgresql on standby
standby# systemctl start postgresql
Reference links:
- https://wiki-bsse.ethz.ch/display/ITDOC/Set+up+PostgreSQL+asynchronous+streaming+replication
- https://www.postgresql.org/docs/10/static/warm-standby.html#STANDBY-SERVER-OPERATION
- https://www.postgresql.org/docs/10/warm-standby.html#SYNCHRONOUS-REPLICATION
- https://www.enterprisedb.com/blog/cheat-sheet-configuring-streaming-synchronous-replication-postgresql
—
“It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you’ll do things differently”
— Warren Buffett