Jump to research

Composed by

Profile picture

С. И.

Views

325

Version history

С. И., 651d ago

July 13, 2023

bidirectional synchronization between two postgres databases

During my research, I analyzed notes from 7 different Reddit posts discussing various aspects of database synchronization, particularly focusing on PostgreSQL. The sources range from discussing bidirectional replication to copying table data between different databases, and they provide various recommendations and tools to achieve these tasks. The consensus across the sources is not uniform, as different users have different preferences and experiences. My analysis will summarize the most relevant and helpful suggestions found in these sources.

Have an opinion? Send us proposed edits/additions and we may incorporate them into this article with credit.

Words

440

Time

5m 10s

Contributors

410

Words read

61.8k

PgBackRest and WAL Archiving

PgBackRest is a third-party tool that offers asynchronous archiving, significantly improving the throughput of archiving and replaying WAL segments in PostgreSQL compared to the built-in archive_command system. Users have shared their positive experiences with pgBackRest, highlighting its efficiency and recommending it for large datasets. For fast and efficient bidirectional synchronization, it is suggested that one should familiarize themselves with and implement third-party solutions like pgBackRest instead of relying solely on built-in tools.
Dbeaver

Dbeaver

Dbeaver is another third-party tool that can automate the process of copying table data between different databases. It allows users to connect to both source and destination databases, select tables to be copied, configure copy options, and start the copy job. Dbeaver can handle various copy modes, identity columns, constraints, and NULL/default values, making it a versatile option for database synchronization.

Using pg_dump and pg_restore

Some users recommend using pg_dump and pg_restore with the -t option to dump and restore only the selected tables between databases. This approach allows for more granular control over the data being transferred and can work well for specific use cases.

SQL Statements and Data-Only Dumps

One suggestion for copying table data between different databases is to create a database dump as SQL statements with "data only" and then execute it in the destination database. This approach is relatively straightforward but may not be suitable for all scenarios, especially when dealing with large datasets or multiple tables.

Using INSERT INTO and SELECT Statements

A user on Reddit suggests using `INSERT INTO table2 SELECT * FROM table1` to copy data between tables in different databases, but notes that this method may not work if the source and destination tables have different schemas or constraints. This method is relatively simple and can be useful for small-scale data copying tasks. In conclusion, there are various methods and tools available for synchronizing data between PostgreSQL databases. Depending on your specific use case and requirements, you may prefer to use third-party tools like pgBackRest or Dbeaver, or rely on built-in PostgreSQL tools and SQL statements. It is essential to evaluate each option based on your needs and the complexity of the data being synchronized.

Jump to top

Research

"ELI5: What is a database? And what is SQL language used for? Why are some comercial databases(oracle) really expensive and some are free?"

  • A database stores data in tables.
  • SQL is used to talk to the database.
  • SQL is used to pull data you need from database.
  • There are many free solutions to build and query databases.
  • Oracle and MS SQL server offer extras like auto-backup, query efficiency analysis, automatic rollover, database syncing, which comes with more costs.
  • The money comes when you add these expensive features as the requirement increases.
  • Flat files are the simplest and the first form of data storage. It’s a data table with all separate pieces of data stored.
  • Flat files are loaded entirely to retrieve data, it’s difficult to share data across a network, it is difficult to prevent malicious activities over the network.
  • A database is different from a flat file as it is its own program, it provides its interface to access data.
  • SQL provides a human-readable way to build complicated data retrieving rules.
  • A database organizes data to reduce redundancy, uses multiple tables to store related data, reduces redundancy, done by creating relations between objects and entities.
  • Databases also offer security measures to keep data secure and manage who can access the database.
  • Expensive enterprise-level databases take security measures to extremes, offers frequent and professional product support, capabilities to manage and analyze large databases, compatibility with other enterprise software, and provide updates in case of necessary security patches.
  • A database can be considered a filing cabinet, and different databases can connect with one another, making it a series of filing cabinets.
  • SQL can navigate databases.
  • Multiple databases can store and share massive amounts of information.
  • A text file with all the names and phone numbers of friends is a database. It could have columns, labels, metadata, or may be lightly structured.

"How do I keep my local database and remote database in sync?"

  • The Reddit post was published 7 years ago and received 20 points
  • The post discusses how to keep a local database and remote database in sync for WordPress projects using WAMP server and Bitbucket for source control
  • One user suggests looking into migrations via Phinx for maintaining data and schema updates
  • Another user notes that data is often the focus for WordPress, and suggests using the Duplicator plugin to clone the whole site to keep everything in sync
  • One user suggests using roundhouse for syncing the database schema, but also notes that they are unsure of how well the tool works for WordPress
  • A user suggests setting up a replication between the databases, adding that while it is complicated and comes with some overhead, it is a neat thing to know about
  • Another user suggests creating a remote database and connecting to it through cPanel, although speeds may be slower when compared to working locally
  • A user notes that they use DBV project, a MySQL Database Version Control System, for syncing changes to custom MySQL projects. However, they also note that this solution may not work for WordPress since it cannot detect changes to the database
  • The DBV project saves each change as a set of individual or combined SQL statements and stores the statements in DBV as a series of revisions, with each revision stored in a different numerically sequential file
  • The same user notes that DBV works well with Git, as it detects changes to the database table and allows users to run SQL statements
  • The focus of the discussion is mostly on MySQL and WordPress, and there is no specific mention of Postgres. However, the topic of syncing databases may be relevant to other database management systems as well.

"DataGrip: Copying table data between different databases into existing tables"

  • User on Reddit is attempting to copy over table data from one DB to another.
  • The user wants to insert data directly into any existing tables on the receiving end, without creating new tables.
  • When the user drags and drops tables from the source to destination database, the modal window that appears defaults to a new table name value, for example, table_1.
  • The user can change the name of the destination table in the modal window to accomplish what they are looking for, but with hundreds of tables in scope, this is not ideal.
  • Someone on Reddit suggests creating a db dump as SQL statements with “data only” and then executing it in the destination DB.
  • Another suggestion is to use pg_dump and pg_restore with the -t option to dump and restore only the selected tables.
  • It is recommended to use a third-party tool to automate the copy process if there are many tables to copy.
  • Dbeaver is suggested as a tool that can automate the copy process.
  • Steps to copy the data using Dbeaver are:
    • Connect to both source and destination databases.
    • Select the table(s) to be copied from the source database.
    • Right-click on the selection and choose “Copy Data to Another Database…”
    • Choose the destination database and configure the copy options as needed.
    • Verify the summary of the copy process and start the copy job.
  • Some of the copy options that can be configured in Dbeaver include:
    • Copy mode (Insert or Insert/Update)
    • Handling of identity columns and constraints
    • Handling of NULL and default values
  • Dbeaver can also preview the data to be copied before starting the copy job.
  • Someone on Reddit suggests using INSERT INTO table2 SELECT * FROM table1 to copy data between tables in different databases, but notes that this method may not work if the databases are on different servers or have different table definitions.

"Synchronizing schema with Redis?"

  • Someone asked for options to copy PostgreSQL tables to Redis, and keep their schemas in sync whenever a change occurs in PostgreSQL.
  • Rspamd (an open source spam filtering system) was mentioned in the context of keeping mail lookup tables in PostgreSQL in sync with Redis, for determine handling preferences.
  • The initial suggestion given is Apache Storm to achieve synchronization, but the user is looking for something else.
  • Two main proposed solutions are suggested:
    1. Use a foreign data wrapper: A foreign data wrapper makes a Redis table available in PostgreSQL, and triggers can be used on PostgreSQL tables to write changes to Redis tables. An extension available for Redis is pg_redis_fdw which provides the functionality.
      • @reddit user proposed this suggestion, with 2 karma points. Another user responded that this is their Plan B because the requirement is for something that’ll arbitrate data between Redis (read-only) and PostgreSQL (read-write) and monitor PostgreSQL for changes from selected tables, almost like replication, but for the data rather than the queries that results in the data.
    2. Use Notify/Listen to push events to Redis.
      • One @reddit user suggested this approach with 1 karma point. Writing a subscriber to push events to Redis is possible. However, if the subscriber goes down, situations may arise where the data gets desynchronized. A workaround proposed is to reload the data when the subscriber comes back online. Another suggestion is to add a trigger on INSERT/UPDATE/DELETE to publish changes.
    3. Use logical replication: Decoding the replication stream can be achieved through various libraries, and the stream will restart from where the subscriber last disconnected, but this may not be viable since postgres has to be supported back to 9.2.
      • @reddit user proposed this approach with 1 karma point. The suggestion is followed by another @reddit user who mentioned that depending on the library used, Listen is supported and keeping it simple should be easy.
  • The user asking for help confirms that the notify/listen approach seems to be the way to go, and might look to keep it simple by implementing the solution themselves.
  • The thread discussing possible solutions to keep PostgreSQL tables synchronized with Redis took place on the r/PostgreSQL subreddit four years ago. The thread has 3 points, and the title is “Synchronizing schema with Redis?”.

"Strategies have you used to sync a PostgreSQL database with Redshift?"

  • The original post on Reddit is asking for strategies to sync data between PostgreSQL and Redshift, both of which are hosted on AWS.
  • The person who posted recommends using Presto/Trino as it has a built-in connector that solved their syncing issues, but they also heard AWS DMS might be a solution.
  • There is a Reddit response highlighting various vendor/solutions to consider.
    • FiveTran is regarded as a leader right now in ELT, and the comment suggests using dbt alongside it for transformation.
    • Matillion is a big competitor of FiveTran that offers transformation and data model management layer as part of the product.
    • Stitchdata focuses on EL kind of solution, currently owned by Talend.
    • Airbyte is a new company with good funding and is an open-source solution.
    • Pipelinewise is based on Singer, supported by Transferwise, and is another open-source solution.
    • Rudderstack can offer all the different types of pipelines someone might need under one platform.
    • Meroxa provides CDC as a service.
    • Debezium is a CDC framework one can implement.
    • AWS has multiple products to consider, such as Glue, DMS, and Kafka + Kafka Connect.
  • Other Reddit users provided their opinion on some of the solutions.
    • One user plans to use DMS but warns that it has issues such as data type conversions, transient errors, and more.
    • Some users suggest that Redshift can read directly from Postgres through dblink, while another user says that external schemas on Redshift might be the solution OP is looking for.
  • There is a discussion on different factors to consider when choosing a solution, such as latency, deletions, cloud solutions, destinations, vendor lock-in, privacy, and more.
  • One user recommends avoiding building something in-house and suggests trying out available solutions instead.
  • The post is one year and seven months old as of writing and has 14 points. There are 15 comments, and some of the comments contain links. One comment has a code disclaimer that notes they work for Rudderstack.

"Easy and cost effective way to sync data from RDS Postgres to Redshift"

  • A startup wants to sync data between RDS Postgres and Redshift.
  • They have a master RDS Postgres database of approximately 500 GB that grows by 1-2 GB/day.
  • They are currently using a Read Replica of Postgres as their data warehouse but are experiencing slow performance due to the increase in data size.
  • They’ve explored different options for syncing data, including using DMS CDC to replicate data in real-time to Redshift, but replication stopped at times due to DDL statements and the high volume of traffic causes Redshift queries to abort.
  • They also explored using paid solutions like Fivetran and Stitchdata. These solutions could be more reliable than DMS CDC and support all DDL operations, but data privacy and cost are a concern.
  • Exporting RDS snapshots to S3 and importing them into Redshift daily is another option, but it’s inefficient as the entire database needs to be imported regularly, and data transfer costs could be high.
  • Using Kinesis Data Firehose to load data into Redshift seems like a good option as it can do CDC and load data into Redshift efficiently because Firehose stages data into S3 first and then uses Copy command to ingest into Redshift. However, the author is unsure how upsert will work in this case.
  • The ideal solution for the startup is cost-efficient, robust, requires minimal maintenance, and has minimal lag in syncing. They are okay with a 12-24 hour lag and don’t necessarily need CDC.
  • Some Reddit users suggest using AWS Glue to load data from Postgres to Redshift instead of DMS CDC, while others suggest using Redshift Spectrum to query S3 directly.
  • Another suggestion is using PostgreSQL for analytics queries with a data volume of up to 5-10 TB and using partitioning and indexing strategies to increase performance.
  • Some users suggest Glue crawlers for setting up external tables in Redshift and using foreign data wrappers in Redshift to connect to RDS.
  • One user recommends using Snowflake or BigQuery as they are usually cheaper and easier for an analyst to do most of the work on.
  • Multiple users warn that operating Redshift can be challenging and requires a full-time data engineer.

"WAL archiving and replay are unacceptably slow without 3rd-party tools like pgBackRest, isn't it?"

  • Archiving and replaying WAL (Write-Ahead-Log) segments in PostgreSQL can be slow and unreliable without the use of third-party tools like pgBackRest.
  • The archive_command system can be clunky and rerun inefficiently for every WAL segment.
  • pgBackRest has asynchronous archiving which offloads archiving to a separate process to improve throughput. It has >10x difference compared to the archive_command system especially with parallelism.
  • The users discuss their experience with backup and restoring large datasets using built-in tools and third-party solutions, mentioning that ignoring the better third-party tools is a poor choice.
  • Streaming replication is great but archive_command still has to be used for backup/pitr purposes.
  • It is possible to bootstrap a replica without using archive_command by using pg_basebackup without any archive_command configured on the primary node.
  • There were talks about making archive_command run in parallel.
  • Users mentioned potential issues with streaming replication due to losing WAL segments between backup and the start of streaming replication on a new standby.
  • pg_receivewal is another tool that helps to address the issue of archiving and replaying WAL segments, but not widely used compared to pgBackRest.
  • There are possible risks to not updating WAL files in the standby node and keeping them in sync to ensure bidirectional replication with Pgpool-II.
  • Users suggest that for fast and efficient bidirectional synchronization, familiarizing oneself with and implementing third-party solutions as opposed to relying on built-in Pgpool-II may be necessary to address issues of high volume throughput and connection, and avoid delays and potential freezes of the database.

💭  Looking into

Best practices for ensuring data consistency in bidirectional synchronization

💭  Looking into

Comparison of built-in replication vs third-party tools for bidirectional synchronization