Article 6:Data Replication Series Part 1:Key Challenges

Data Integration Data Replication
发布于2024-6-15

Data replication remains a perennially relevant topic, whether for ensuring high availability of IT core systems, replacing domestic alternatives in database systems, or integrating big data platforms. Through extensive client implementations, Shuyi Technology has established data replication and integration as two of our most critical products and services. In this series, Xiao Liu will explore this subject. This article begins by defining five fundamental challenges in data replication, followed by detailed discussions grounded in industry insights and Shuyi's practical implementation experiences.

This is not about synchronization between database instance copies, but about data replication between two different instances, or even different architectures of data management systems

Fundamental issues

1. Offline or real time: Whether it's a black cat or a white cat, catching mice makes a good cat

Offline and real-time is a relative concept, there is no absolute offline, there is no absolute real-time.

Here we define offline and real-time replication from the perspectives of data replication targets and business requirements. Data is considered offline when its freshness at the target end exceeds a certain threshold, while it's real-time otherwise. For example, in a business system requiring access to the latest data from the source within 5 seconds, the threshold for offline versus online replication is 5 seconds. If a migration software claims real-time replication but causes target-end latency to reach 10 seconds due to technical limitations, it still doesn't qualify as true real-time replication. Conversely, if a system doesn't rely on real-time replication technology (like transaction logs) but uses specific markers to ensure batch replication delays under 5 seconds, it still meets business real-time requirements.

The Liquid software by YuYi supports multiple replication modes, including offline batch, near real-time, and real-time replication. These three modes are designed from a technical perspective, where the more real-time the architecture, the more complex it becomes. Our advice to clients is not to focus too much on technology, but to choose a mode that meets the business's requirements for data freshness specifications.

2. One-way replication and two-way replication: sounds great, but it's a hassle

In unidirectional replication, data is transferred solely from the source to the target. This represents the predominant scenario for data replication. In bidirectional replication systems, two systems serve as each other's source and target endpoints. Bidirectional replication is typically implemented in active-active environments. For example, when two database systems are deployed across two Availability Zones (AZs) with equalized deployment, and applications perform write operations across AZs, bidirectional replication introduces two additional challenges compared to unidirectional replication:

2.1 Copy by circulation

System A inserts a data R1 and generates RLog1. RLog1 is copied to system B, which plays back the corresponding R1 log RLog1. This log is then copied to system A for playback, and the cycle repeats. There are two ways to solve this problem:

1)Log Tagging: Add a tag to the log to distinguish it as self-generated. The system replaying this log will not resend it to the target end. This is suitable for users who can control the kernel autonomously, such as PostgreSQL 16 and HexaDB 3.x.

2)Data marking: An additional write operation is inserted at the beginning of a transaction (which can be done without modifying the data, but only using the system log) to mark that the transaction is copied rather than generated by itself. The target end recognizes this operation and ignores the copy. This is suitable for scenarios where the business can be modified accordingly.

2.2 data collision

1)The most effective conflict resolution strategy is prevention. When systems A and B share data flow, implement ID-specific updates: ensure ID1 is only modified in System A while ID2 remains exclusive to System B. Strictly avoid simultaneous writes to the same ID across both systems. For scenarios involving auto-increment IDs, configure distinct increment rules for each system. Conflict prevention should cover at least 99% of cases; otherwise, resolving conflicts through manual intervention will consume excessive time—so much so that you might lose the chance to watch a movie with friends.

2)When conflicts become unavoidable, there are at least two solutions to consider: First, if your data row contains a timestamp accurate to the millisecond (TPS <= 1000), let the later party prevail during playback. Second, implement a territorial dispute resolution mechanism through "setting aside disputes for joint development" – document the conflict details, have business colleagues determine the winner, and then you can go see a movie with friends.

3. Physical replication versus logical replication: The general versus the specific

Physical replication refers to the replication of physical storage blocks or the physical changes of storage blocks. Logical replication is the replication of the upper-level abstract operation language, which is commonly seen as SQL statements or plain data.

Physical replication is generally only suitable for data replication between systems that are strictly isomorphic (even different versions of an isomorphic system have strict requirements). Logical replication is much broader, because SQL has standard specifications and plaintext data can be converted twice.

Most of the data replication in heterogeneous systems uses logical replication. Both modes are supported by HexaDB, and physical replication is generally used for disaster recovery in homogeneous systems.

4. Heterogeneous type conversion: affects subsequent stages like a virus

The solution to data conflicts is the same. The best way to solve inconsistent types is to make them consistent or compatible, without conversion. When two systems have different data types, we generally use two methods:

a)Equivalent conversion. For example, converting to character form storage.

b)Cut off and discard. For example, data precision is discarded, strings are truncated. Of course, we can record exceptions for later troubleshooting.

5. Schema Change or No Schema Change: The cost of changing

When replicating structured data, we inevitably encounter changes in the data structure. Since asynchronous replication is typically used (while synchronous replication is generally employed for data replica HA), any structural modifications to tables may render the original structure descriptions obsolete. This could lead to errors during target-side re-replication (such as adding, deleting, or modifying columns). We employ two strategies to address this:

a)Using the built-in metadata registrar, different versions of metadata are registered and managed according to the DDL statements in the log, and the corresponding structure description is obtained according to the metadata version when replaying.

b)The third-party metadata management system is adopted to track the version of the database system metadata and obtain the corresponding structure description according to the metadata version during playback.

The choice of the two strategies depends on the business system. Liquid Software of HexaDB currently adopts the first strategy to achieve the purpose of supporting DDL.

6. No-Schema to Schema: A bridge between two worlds

No-Schema generally refers to non-relational data, while Schema refers to relational data. In reality, there are often scenarios of copying non-relational data to relational data, such as copying K-V type LevelDB and document type MongoDB data to HexaDB-A, a relational database.

The simplest and most desirable scenario is when No-Schema data maintains its original format, allowing straightforward mapping. However, reality often proves harsher: K-V formats are unpredictable and documents vary widely. This is where an intelligent Schema Detector comes into play, performing the following tasks:

a)Through special algorithms, the source data is scanned and detected for various schemas, which are then combined into a flat collection of all schemas: a wide table that can tolerate all variations. Alternatively, hierarchical tables conforming to its structure are created (hierarchical processing).

b)If the algorithm "fails", take some measures to deal with it: such as recording exceptions, changing the structure of the target table, etc.

7. Database or non-database: more like stamp collecting

The human world is vibrant and colorful, while the digital realm thrives with complex data. Beyond relational databases, K-V systems, and documents, we encounter diverse data sources including FTP, Syslog, plain text files, web sockets, web pages, and message queues – each requiring careful handling. Different protocols and formats employ distinct replication strategies, which we will explore in detail later.

Integrity and reliability

1. Integrity: Adhere to Once and Only Once

The fundamental principle of ensuring integrity is to ensure that each piece of data from the source is replayed exactly once at the target end. The requirement for single replay is self-evident. In systems with primary keys or unique key constraints, this can be enforced through such constraints. However, for other data structures, duplicate records may occur, and detecting such duplicates proves challenging.

To ensure once and only once, the industry's common practice is to record the location. There are many forms of location: it could be a serial number in a log or message, or a specific column in a data row. Now that we've found this location, how do we use it?

a)If the system has constraints, we can use asynchronous bit recording. After fault recovery, the constraints help us eliminate duplicate data, which may lead to longer fault recovery time.

b)If no constraints can be used, we can only use "assembly point transactions" to solve the problem. Create another bit table to record the consumed bits and assemble them with business data writing into a transaction. The atomicity of the target database transaction is used to ensure the integrity of bit data and business data.

If you don't have time to implement either of these approaches, we offer you another option: Luban's duplicate data detection and processing. Luban provides a feature that helps users find duplicate or suspected duplicate data rows in massive data, with multiple parameters to improve scan time and accuracy.

2. N ways to check integrity

There are several dimensions to evaluate whether data has been migrated completely:

a) Quantity. Data serves as a coarse-grained evaluation metric and remains a reliable method for performance comparison. However, when handling ultra-large datasets, products employ different approaches to achieve precise counting, which ultimately results in varying costs for obtaining row counts. Most systems also provide row count estimation methods, making them suitable for scenarios where high accuracy isn't critical.

b) Content integrity verification employs two approaches: exact matching and sampling. The exact matching method traverses data rows to compare content (such as generating hash features for all columns or specific key columns, requiring special handling of incompatible data types). For less stringent accuracy requirements, row sampling can be implemented by adjusting the sampling ratio to achieve different levels of precision.

Both approaches mentioned above are only effective for static data. For dynamically growing data, we need to adopt a dynamic comparison method. A feasible approach involves constructing a Merkle tree, which may require an additional server to run the service. Liquid currently adopts this approach, supporting dynamic retrieval of integrity comparison results between two system data sets and identifying inconsistent rows.

3. Do migration tools want to be HA

If your remote data is ephemeral and critical to you, consider introducing HA technology. Otherwise, you can streamline the HA architecture and choose reentry or full redo.

Function

1. Source side acceleration and target side acceleration

There are three links in the whole data replication link: source acquisition, intermediate buffer, and target playback. The system improvement is mainly in source acquisition and target playback. There are two main ways to improve performance:

a) Increase parallelism

The source's parallelism can be implemented across multiple tables or within a single table. Multi-table parallelism is straightforward, primarily used for existing data replication scenarios where multiple tasks simultaneously retrieve data from different tables in the source. Single-table parallelism leverages specific data structures through partitioning based on primary or unique keys, enabling batch processing via multiple tasks. When the source is actively processing business operations, it must ensure consistency for new data by supporting snapshot reading and obtaining a snapshot checkpoint. After completing existing data retrieval, incremental updates continue from this checkpoint.

b) SQL or SQL bypass

This approach primarily targets the target end. While previous discussions focused on replaying data through SQL-based methods (such as statements, copy operations, etc.), this method employs a more advanced technique that bypasses the SQL layer (including parsing, plan generation, and execution phases) by directly writing to database files. Although this approach typically delivers exceptional performance gains, it imposes limitations on read/write operations at the target end.

2. Build a Chinese highway or a German highway

a) Speed limit or no speed limit

We often assume that faster replication of existing data is ideal, as you wouldn't want to miss tonight's movie. However, reality often proves otherwise: the faster you read, the greater the impact on the source system's IO and CPU usage, which can disrupt active operations. Your clients would never agree to this. Therefore, migration requires us to control our impulses. Implementing throttling measures becomes essential—such as limiting concurrent connections, controlling read batch sizes, and regulating intervals between consecutive reads.

b) Ideally, it should be adaptive: unattended

As previously mentioned, while the source system maintains active operations, some services aren't 24/7. Why not ramp up data replication during off-peak periods? Liquid's solution involves monitoring resource utilization at the source and dynamically adjusting data acquisition volumes autonomously until reaching predefined resource consumption thresholds.

Consistency

1. The contradiction between performance and consistency

Consistency has always been a serious topic in the industry, because without it, some programmers will try to add wings to a sports car. For serious systems, we need to take a serious approach.

Consider the concordance in several dimensions:

a) Internal consistency

To ensure table-level consistency, we sometimes partition and write data in parallel. This includes operations like the previously mentioned parallel table operations, or processing different operations such as INSERT, UPDATE, and DELETE in parallel. However, these operations require strict scenario constraints and cannot be generalized as universally applicable.

b) Table 1. Consistency

Data replication across multiple tables is performed in parallel. This is fine when there is no read operation on the target end or there is no multi-table consistency requirement. In other scenarios, the data replication system needs to do special processing.

c) Final consistency and strict consistency

The table partitioned writes and inter-table parallel operations mentioned above can only achieve eventual consistency without special processing. For scenarios requiring strict consistency, Liquid has adopted an effective strategy by implementing a "maximum parallel interval adjudication" mechanism. This mechanism identifies the most feasible record intervals that satisfy strict consistency requirements, enabling parallel processing within these intervals. Through this approach, the system achieves 80% of the performance of eventual consistency while maintaining strict consistency compliance.

Fault diagnosis and data completion

1. The nightmare began when the transfer order was issued

The key issues of the data replication process have been briefly defined above. You may not encounter all of these issues, but if you do not consider and troubleshoot them before you type the Start Replication command, you or your customers will face a lot of trouble.

a) At the beginning of replication: The source database cannot provide you with a user with high enough privileges

b) During the replication process: the source system is limited in flow, network jitter leads to frequent errors, the target end performance is insufficient, and the intermediate buffer layer is congested.

c) Copy end: The client finds that N records are lost on the target side, and the client finds that there are N more records on the target side, etc. You are limited to give a list of N records within 10 minutes, and fill in the data...

2. How to have a good dream

This is just the beginning, and there will be a series of articles to share more details with you later. Below are a few tips from Xia 6 in the process of delivering and implementing data replication:

a) Develop a complete migration process and automate it so that we are not flustered and flustered in front of customers and bosses.

b) Do a good job of resuming the breakpoint and verify the consistency guarantee logic.

c) Be prepared for fast and accurate data consistency checks.

d) The copy tool keeps track of all the checkpoints, and verbose logs are better than verbose customers.