redis

mirror of https://github.com/redis/redis.git synced 2026-01-14 14:58:02 -05:00

Author	SHA1	Message	Date
Yuan Wang	033abd6f57	Async IO threads (#13665 ) ## Introduction Redis introduced IO Thread in 6.0, allowing IO threads to handle client request reading, command parsing and reply writing, thereby improving performance. The current IO thread implementation has a few drawbacks. - The main thread is blocked during IO thread read/write operations and must wait for all IO threads to complete their current tasks before it can continue execution. In other words, the entire process is synchronous. This prevents the efficient utilization of multi-core CPUs for parallel processing. - When the number of clients and requests increases moderately, it causes all IO threads to reach full CPU utilization due to the busy wait mechanism used by the IO threads. This makes it challenging for us to determine which part of Redis has reached its bottleneck. - When IO threads are enabled with TLS and io-threads-do-reads, a disconnection of a connection with pending data may result in it being assigned to multiple IO threads simultaneously. This can cause race conditions and trigger assertion failures. Related issue: https://github.com/redis/redis/issues/12540 Therefore, we designed an asynchronous IO threads solution. The IO threads adopt an event-driven model, with the main thread dedicated to command processing, meanwhile, the IO threads handle client read and write operations in parallel. ## Implementation ### Overall As before, we did not change the fact that all client commands must be executed on the main thread, because Redis was originally designed to be single-threaded, and processing commands in a multi-threaded manner would inevitably introduce numerous race and synchronization issues. But now each IO thread has independent event loop, therefore, IO threads can use a multiplexing approach to handle client read and write operations, eliminating the CPU overhead caused by busy-waiting. the execution process can be briefly described as follows: the main thread assigns clients to IO threads after accepting connections, IO threads will notify the main thread when clients finish reading and parsing queries, then the main thread processes queries from IO threads and generates replies, IO threads handle writing reply to clients after receiving clients list from main thread, and then continue to handle client read and write events. ### Each IO thread has independent event loop We now assign each IO thread its own event loop. This approach eliminates the need for the main thread to perform the costly `epoll_wait` operation for handling connections (except for specific ones). Instead, the main thread processes requests from the IO threads and hands them back once completed, fully offloading read and write events to the IO threads. Additionally, all TLS operations, including handling pending data, have been moved entirely to the IO threads. This resolves the issue where io-threads-do-reads could not be used with TLS. ### Event-notified client queue To facilitate communication between the IO threads and the main thread, we designed an event-notified client queue. Each IO thread and the main thread have two such queues to store clients waiting to be processed. These queues are also integrated with the event loop to enable handling. We use pthread_mutex to ensure the safety of queue operations, as well as data visibility and ordering, and race conditions are minimized, as each IO thread and the main thread operate on independent queues, avoiding thread suspension due to lock contention. And we implemented an event notifier based on `eventfd` or `pipe` to support event-driven handling. ### Thread safety Since the main thread and IO threads can execute in parallel, we must handle data race issues carefully. client->flags The primary tasks of IO threads are reading and writing, i.e. `readQueryFromClient` and `writeToClient`. However, IO threads and the main thread may concurrently modify or access `client->flags`, leading to potential race conditions. To address this, we introduced an io-flags variable to record operations performed by IO threads, thereby avoiding race conditions on `client->flags`. Pause IO thread In the main thread, we may want to operate data of IO threads, maybe uninstall event handler, access or operate query/output buffer or resize event loop, we need a clean and safe context to do that. We pause IO thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To avoid thread suspended, we use busy waiting to confirm the target status. Besides we use atomic variable to make sure memory visibility and ordering. We introduce these functions to pause/resume IO Threads as below. ``` pauseIOThread, resumeIOThread pauseAllIOThreads, resumeAllIOThreads pauseIOThreadsRange, resumeIOThreadsRange ``` Testing has shown that `pauseIOThread` is highly efficient, allowing the main thread to execute nearly 200,000 operations per second during stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can handle up to nearly 56,000 operations per second. But operations performed between pausing and resuming IO threads must be quick; otherwise, they could cause the IO threads to reach full CPU utilization. freeClient and freeClientAsync The main thread may need to terminate a client currently running on an IO thread, for example, due to ACL rule changes, reaching the output buffer limit, or evicting a client. In such cases, we need to pause the IO thread to safely operate on the client. maxclients and maxmemory-clients updating When adjusting `maxclients`, we need to resize the event loop for all IO threads. Similarly, when modifying `maxmemory-clients`, we need to traverse all clients to calculate their memory usage. To ensure safe operations, we pause all IO threads during these adjustments. Client info reading The main thread may need to read a client’s fields to generate a descriptive string, such as for the `CLIENT LIST` command or logging purposes. In such cases, we need to pause the IO thread handling that client. If information for all clients needs to be displayed, all IO threads must be paused. Tracking redirect Redis supports the tracking feature and can even send invalidation messages to a connection with a specified ID. But the target client may be running on IO thread, directly manipulating the client’s output buffer is not thread-safe, and the IO thread may not be aware that the client requires a response. In such cases, we pause the IO thread handling the client, modify the output buffer, and install a write event handler to ensure proper handling. clientsCron In the `clientsCron` function, the main thread needs to traverse all clients to perform operations such as timeout checks, verifying whether they have reached the soft output buffer limit, resizing the output/query buffer, or updating memory usage. To safely operate on a client, the IO thread handling that client must be paused. If we were to pause the IO thread for each client individually, the efficiency would be very low. Conversely, pausing all IO threads simultaneously would be costly, especially when there are many IO threads, as clientsCron is invoked relatively frequently. To address this, we adopted a batched approach for pausing IO threads. At most, 8 IO threads are paused at a time. The operations mentioned above are only performed on clients running in the paused IO threads, significantly reducing overhead while maintaining safety. ### Observability In the current design, the main thread always assigns clients to the IO thread with the least clients. To clearly observe the number of clients handled by each IO thread, we added the new section in INFO output. The `INFO THREADS` section can show the client count for each IO thread. ``` # Threads io_thread_0:clients=0 io_thread_1:clients=2 io_thread_2:clients=2 ``` Additionally, in the `CLIENT LIST` output, we also added a field to indicate the thread to which each client is assigned. `id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name= lib-ver= io-thread=1` ## Trade-off ### Special Clients For certain special types of clients, keeping them running on IO threads would result in severe race issues that are difficult to resolve. Therefore, we chose not to offload these clients to the IO threads. For replica, monitor, subscribe, and tracking clients, main thread may directly write them a reply when conditions are met. Race issues are difficult to resolve, so we have them processed in the main thread. This includes the Lua debug clients as well, since we may operate connection directly. For blocking client, after the IO thread reads and parses a command and hands it over to the main thread, if the client is identified as a blocking type, it will be remained in the main thread. Once the blocking operation completes and the reply is generated, the client is transferred back to the IO thread to send the reply and wait for event triggers. ### Clients Eviction To support client eviction, it is necessary to update each client’s memory usage promptly during operations such as read, write, or command execution. However, when a client operates on an IO thread, it is not feasible to update the memory usage immediately due to the risk of data races. As a result, memory usage can only be updated either in the main thread while processing commands or in the `ClientsCron` periodically. The downside of this approach is that updates might experience a delay of up to one second, which could impact the precision of memory management for eviction. To avoid incorrectly evicting clients. We adopted a best-effort compensation solution, when we decide to eviction a client, we update its memory usage again before evicting, if the memory used by the client does not decrease or memory usage bucket is not changed, then we will evict it, otherwise, not evict it. However, we have not completely solved this problem. Due to the delay in memory usage updates, it may lead us to make incorrect decisions about the need to evict clients. ### Defragment In the majority of cases we do NOT use the data from argv directly in the db. 1. key names We store a copy that we allocate in the main thread, see `sdsdup()` in `dbAdd()`. 2. hash key and value We store key as hfield and store value as sds, see `hfieldNew()` and `sdsdup()` in `hashTypeSet()`. 3. other datatypes They don't even use SDS, so there is no reference issues. But in some cases client the data from argv may be retain by the main thread. As a result, during fragmentation cleanup, we need to move allocations from the IO thread’s arena to the main thread’s arena. We always allocate new memory in the main thread’s arena, but the memory released by IO threads may not yet have been reclaimed. This ultimately causes the fragmentation rate to be higher compared to creating and allocating entirely within a single thread. The following cases below will lead to memory allocated by the IO thread being kept by the main thread. 1. string related command: `append`, `getset`, `mset` and `set`. If `tryObjectEncoding()` does not change argv, we will keep it directly in the main thread, see the code in `tryObjectEncoding()`(specifically `trimStringObjectIfNeeded()`) 2. block related command. the key names will be kept in `c->db->blocking_keys`. 3. watch command the key names will be kept in `c->db->watched_keys`. 4. [s]subscribe command channel name will be kept in `serverPubSubChannels`. 5. script load command script will be kept in `server.lua_scripts`. 7. some module API: `RM_RetainString`, `RM_HoldString` Those issues will be handled in other PRs. ## Testing ### Functional Testing The commit with enabling IO Threads has passed all TCL tests, but we did some changes: Client query buffer: In the original code, when using a reusable query buffer, ownership of the query buffer would be released after the command was processed. However, with IO threads enabled, the client transitions from an IO thread to the main thread for processing. This causes the ownership release to occur earlier than the command execution. As a result, when IO threads are enabled, the client's information will never indicate that a shared query buffer is in use. Therefore, we skip the corresponding query buffer tests in this case. Defragment: Add a new defragmentation test to verify the effect of io threads on defragmentation. Command delay: For deferred clients in TCL tests, due to clients being assigned to different threads for execution, delays may occur. To address this, we introduced conditional waiting: the process proceeds to the next step only when the `client list` contains the corresponding commands. ### Sanitizer Testing The commit passed all TCL tests and reported no errors when compiled with the `fsanitizer=thread` and `fsanitizer=address` options enabled. But we made the following modifications: we suppressed the sanitizer warnings for clients with watched keys when updating `client->flags`, we think IO threads read `client->flags`, but never modify it or read the `CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so there is no actual data race. ## Others ### IO thread number In the new multi-threaded design, the main thread is primarily focused on command processing to improve performance. Typically, the main thread does not handle regular client I/O operations but is responsible for clients such as replication and tracking clients. To avoid breaking changes, we still consider the main thread as the first IO thread. When the io-threads configuration is set to a low value (e.g., 2), performance does not show a significant improvement compared to a single-threaded setup for simple commands (such as SET or GET), as the main thread does not consume much CPU for these simple operations. This results in underutilized multi-core capacity. However, for more complex commands, having a low number of IO threads may still be beneficial. Therefore, it’s important to adjust the `io-threads` based on your own performance tests. Additionally, you can clearly monitor the CPU utilization of the main thread and IO threads using `top -H -p $redis_pid`. This allows you to easily identify where the bottleneck is. If the IO thread is the bottleneck, increasing the `io-threads` will improve performance. If the main thread is the bottleneck, the overall performance can only be scaled by increasing the number of shards or replicas. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: oranagra <oran@redislabs.com>	2024-12-22 19:30:37 +08:00
debing.sun	3fc7ef8f81	Fix race in stream-cgroups test (#13593 ) failed CI: https://github.com/redis/redis/actions/runs/11171608362/job/31056659165 https://github.com/redis/redis/actions/runs/11226025974/job/31205787575	2024-10-12 09:23:19 +08:00
Moti Cohen	5f28bd96db	Fix race in HFE tests (#13563 ) Test 1 - give more time for expiration Test 2 - Evaluate expiration time boundaries [+1,+2] before setting expiration [+1] Test 3 - Avoid race on test HFEs propagated to replica	2024-09-23 10:30:29 +03:00
Moti Cohen	9a89e32a95	HFE - Fix key ref by the hash on RENAME/MOVE/SWAPDB/RESTORE (#13539 ) If the hash previously had HFEs (hash-fields with expiration) but later no longer does, the key ref in the hash might become outdated after a MOVE, COPY, RENAME or RESTORE operation. These commands maintain the key ref only if HFEs are present. That is, we can only be sure that key ref is valid as long as the hash has HFEs.	2024-09-12 12:40:12 +03:00
Moti Cohen	569584d463	HFE - Simplify logic of HGETALL command (#13425 )	2024-09-05 12:48:44 +03:00
Zihao Lin	6ceadfb580	Improve GETRANGE command behavior (#12272 ) Fixed the issue about GETRANGE and SUBSTR command return unexpected result caused by the `start` and `end` out of definition range of string. --- ## break change Before this PR, when negative `end` was out of range (i.e., end < -strlen), we would fix it to 0 to get the substring, which also resulted in the first character still being returned for this kind of out of range. After this PR, we ensure that `GETRANGE` returns an empty bulk when the negative end index is out of range. Closes #11738 --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-08-20 12:34:43 +08:00
debing.sun	2b88db90aa	Fix incorrect lag due to trimming stream via XTRIM command (#13473 ) ## Describe When using the `XTRIM` command to trim a stream, it does not update the maximal tombstone (`max_deleted_entry_id`). This leads to an issue where the lag calculation incorrectly assumes that there are no tombstones after the consumer group's last_id, resulting in an inaccurate lag. The reason XTRIM doesn't need to update the maximal tombstone is that it always trims from the beginning of the stream. This means that it consistently changes the position of the first entry, leading to the following scenarios: 1) First entry trimmed after maximal tombstone: If the first entry is trimmed to a position after the maximal tombstone, all tombstones will be before the first entry, so they won't affect the consumer group's lag. 2) First entry trimmed before maximal tombstone: If the first entry is trimmed to a position before the maximal tombstone, the maximal tombstone will not be updated. ## Solution Therefore, this PR optimizes the lag calculation by ensuring that when both the consumer group's last_id and the maximal tombstone are behind the first entry, the consumer group's lag is always equal to the number of remaining elements in the stream. Supplement to PR https://github.com/redis/redis/pull/13338	2024-08-16 23:13:31 +08:00
debing.sun	b94b714f81	Fix error message for XREAD command with wrong parameter (#13474 ) Fixed a missing from #13117. When the number of streams is incorrect, the error message for `XREAD` needs to include the '+' symbol.	2024-08-14 21:40:43 +08:00
Moti Cohen	806459f481	On HDEL last field with expiry, update global HFE DS (#13470 ) Hash field expiration is optimized to avoid frequent update global HFE DS for each field deletion. Eventually active-expiration will run and update or remove the hash from global HFE DS gracefully. Nevertheless, statistic "subexpiry" might reflect wrong number of hashes with HFE to the user if HDEL deletes the last field with expiration in hash (yet there are more fields without expiration). Following this change, if HDEL the last field with expiration in the hash then take care to remove the hash from global HFE DS as well.	2024-08-11 16:39:03 +03:00
debing.sun	93fb83b4cb	Fix incorrect lag field in XINFO when tombstone is after the last_id of consume group (#13338 ) Fix #13337 Ths PR fixes fixed two bugs that caused lag calculation errors. 1. When the latest tombstone is before the first entry, the tombstone may stil be after the last id of consume group. 2. When a tombstone is after the last id of consume group, the group's counter will be invalid, we should caculate the entries_read by using estimates.	2024-07-30 22:31:31 +08:00
Moti Cohen	a84cc20aef	HFE - Fix statistic to count also lazy expired and rename INFO params (#13372 ) * INFO command : rename `hashes_with_expiry_fields` to `subexpiry` * INFO command : rename `expired_hash_fields` to `expired_subkeys` * Fix statistic of `expired_subkeys` to count also lazy expired * Remove TODOs comments leftover in TCL * Fix potential flaky test of rdb load of hash-field-expiration	2024-07-02 18:22:10 +03:00
Moti Cohen	a9267137ee	HFE - count in command must match actual number of fields (#13369 ) There was wrong preliminary assumption that we can optionally provide vector of arguments more than count. This is error-prone approach that leaded to actual error in that case. This PR enforce that vector of argument match count. Also fixed flaky HRANDFIELD test.	2024-06-26 14:12:06 +03:00
Moti Cohen	5eac99c312	Fix H(P)EXPIREAT command to propagate HDEL as well (#13364 ) H(P)EXPIREAT command might delete fields in case the absolute time is in the past. Those HDELs need to be propagated as well. In general, as we need to propagate H(P)EXPIRE(AT) command to the replica, each field that is mentioned in the command should be categorized into one of the four options: 1. Managed to update field’s expiration time - propagate it to replica as part of the HPEXPIREAT command. 2. Deleted the field because the time is in the past - propagate also HDEL command to delete the field and remove the field from the propagated HPEXPIREAT. 3. Condition not met for the field - Remove the field from the propagated HPEXPIREAT command. 4. Field does not exists - Remove the field from the propagated HPEXPIREAT command. If none of the provided fields match option number 1, then avoid also propagating the HPEXPIREAT command to the replica. This approach is aligned with the EXPIRE command: If a given key has already expired, then DEL will be propagated instead of EXPIRE command. If condition not met, then command will be rejected. Otherwise, EXPIRE command will be propagated for given key.	2024-06-25 13:15:09 +03:00
Moti Cohen	e26ea35cd4	Adapt HRANDFIELD to HFE feature (#13348 ) Considerations for the selected imp of HRANDFIELD & HFE feature: HRANDFIELD might access any of the fields in the hash as some of them might be expired. And so the Implementation of HRANDFIELD along with HFEs might be one of the two options: 1. Expire hash-fields before diving into handling HRANDFIELD. 2. Refine HRANDFIELD cases to deal with expired fields. Regarding the first option, as reference, the command RANDOMKEY also declareson O(1) complexity, yet might be stuck on a very long (but not infinite) loop trying to find non-expired keys. Furthermore RANDOMKEY also evicts expired keys along the way even though it is categorized as a read-only command. Note that the case of HRANDFIELD is more lightweight versus RANDOMKEY since HFEs have much more effective and aggressive active-expiration for fields behind. The second option introduces additional implementation complexity to HRANDFIELD. We could further refine HRANDFIELD cases to differentiate between scenarios with many expired fields versus few expired fields, and adjust based on the percentage of expired fields. However, this approach could still lead to long loops or necessitate expiring fields before selecting them. For the “lightweight” cases it is also expected to have a lightweight expiration. Considering the pros and cons, and the fact that HRANDFIELD is an infrequent command (particularly with HFEs) and the fact we have effective active-expiration behind for hash-fields, it is better to keep it simple and choose option number 1. Other changes: * Don't mark command dirty by internal hashTypeExpire(). It causes to read only command of HRANDFIELD to be accidently propagated (This flag should be indicated at higher level, by the command functions). * Align `hashTypeExpireIfNeeded()` and `hashTypeGetValue()` to be more aligned with `expireIfNeeded()` logic of keyspace.	2024-06-24 18:11:53 +03:00
Ozan Tezcan	4aa25d042c	Reply with array of return codes if the key does not exist for HFE commands (#13343 ) Currently, HFE commands reply with empty array if the key does not exist. Though, non-existing key and empty key is the same thing. It means fields given in the command do not exist in the empty key. So, replying with an array of 'no field' error codes (-2) suits better to Redis logic. e.g. Similarly, `hmget` returns array of nulls if the key does not exist. After this PR: ``` 127.0.0.1:6379> hpersist missingkey fields 2 a b 1) (integer) -2 2) (integer) -2 ```	2024-06-14 09:35:05 +03:00
Moti Cohen	f01fdc3960	Reserve 2 bits out of EB_EXPIRE_TIME_MAX for possible future use (#13331 ) Reserve 2 bits out of hash-field expiration time (`EB_EXPIRE_TIME_MAX`) for possible future lightweight indexing/categorizing of fields. It can be achieved by hacking HFE as follows: ``` HPEXPIREAT key [ 2^47 + USER_INDEX ] FIELDS numfields field [field …] ``` Redis will also need to expose kind of `HEXPIRESCAN` and `HEXPIRECOUNT` for this idea. Yet to be better defined. `HFE_MAX_ABS_TIME_MSEC` constraint must be enforced only at API level. Internally, the expiration time can be up to `EB_EXPIRE_TIME_MAX` for future readiness.	2024-06-10 16:57:26 +03:00
Moti Cohen	ce121b9261	HFE - Avoid lazy expire if called by modules + cleanup (#13326 ) Need to be carefull if called by modules since modules API allow to open and close key handler. We don't want to invalidate the handler underneath. * hashTypeExists(), hashTypeGetValueObject() - will return the logical state of the field. A flag will indicate noExpire. * RM_HashGet() - Will get NULL if the field expired. Fields won’t be deleted. * RM_ScanKey() - might return 0 items if all fields got expired. Fields won’t be deleted. * RM_HashSet() - If set, then override expired field. If delete, we can either delete or leave it to active-expiration. XX/NX - logically correct (Verify with tests). Nice to have (not implemented): * RedisModule_CloseKey() - We can local active-expire up-to 100 items. Note: Length will be wrong to modules just like redis (Count expired fields).	2024-06-10 11:24:26 +03:00
debing.sun	9a2c6ba4e7	Prevent negative expire parameter in HEXPIRE/HEXPIREAT/HPEXPIRE/HPEXPIREAT commands (#13310 ) 1. Don't allow HEXPIRE/HEXPIREAT/HPEXPIRE/HPEXPIREAT command expire parameters is negative 2. Remove a dead code reported from Coverity. when `unit` is not `UNIT_SECONDS`, the second `if (expire > (long long) EB_EXPIRE_TIME_MAX)` will be dead code. ```c # t_hash.c 2988 /* Check expire overflow / cond_at_most: Condition expire > 281474976710655LL, taking false branch. Now the value of expire is at most 281474976710655. 2989 if (expire > (long long) EB_EXPIRE_TIME_MAX) { 2990 addReplyErrorExpireTime(c); 2991 return; 2992 } 2994 if (unit == UNIT_SECONDS) { 2995 if (expire > (long long) EB_EXPIRE_TIME_MAX / 1000) { 2996 addReplyErrorExpireTime(c); 2997 return; 2998 } 2999 expire = 1000; 3000 } else { at_most: At condition expire > 281474976710655LL, the value of expire must be at most 281474976710655. dead_error_condition: The condition expire > 281474976710655LL cannot be true. 3001 if (expire > (long long) EB_EXPIRE_TIME_MAX) { CID 494223: (#1 of 1): Logically dead code (DEADCODE) dead_error_begin: Execution cannot reach this statement: addReplyErrorExpireTime(c);. 3002 addReplyErrorExpireTime(c); 3003 return; 3004 } 3005 } ``` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>	2024-06-04 20:35:26 +08:00
jonghoonpark	5a3534f9b5	dynamically list test files (#13220 ) Related issue https://github.com/redis/redis/issues/13219 Motivation Currently we have to manually update the all_tests variable when introducing new test files. Modification I have modified it to list test files dynamically, but instead of modifying it to add all test files, I have modified it to only add only test files from the following 4 paths - unit - unit/type - unit/cluster - integration so that it doesn't deviate too much from what we already do Result - dynamically list test files to all_tests variable - close issue https://github.com/redis/redis/issues/13219 Additional information - removed `list-common.tcl` file and added `generate_largevalue_test_array` proc in `util.tcl`. because `list-common.tcl` is not a test file - There is an order dependency. So I added a code to the "Is a ziplist encoded Hash promoted on big payload?" test that resets hash-max-listpack-value to the default (64). --------- Signed-off-by: jonghoonpark <dev@jonghoonpark.com> Co-authored-by: debing.sun <debing.sun@redis.com>	2024-05-30 19:09:26 +08:00
debing.sun	7b9e960690	Hash Field Expiration (#13303 ) ## Background This PR introduces support for field-level expiration in Redis hashes. Previously, Redis supported expiration only at the key level, but this enhancement allows setting expiration times for individual fields within a hash. ## New commands * HEXPIRE * HEXPIREAT * HEXPIRETIME * HPERSIST * HPEXPIRE * HPEXPIREAT * HPEXPIRETIME * HPTTL * HTTL ## Short example from @moticless ```sh 127.0.0.1:6379> hset myhash f1 v1 f2 v2 f3 v3 (integer) 3 127.0.0.1:6379> hpexpire myhash 10000 NX fields 2 f2 f3 1) (integer) 1 2) (integer) 1 127.0.0.1:6379> hpttl myhash fields 3 f1 f2 f3 1) (integer) -1 2) (integer) 9997 3) (integer) 9997 127.0.0.1:6379> hgetall myhash 1) "f3" 2) "v3" 3) "f2" 4) "v2" 5) "f1" 6) "v1" ... after 10 seconds ... 127.0.0.1:6379> hgetall myhash 1) "f1" 2) "v1" 127.0.0.1:6379> ``` ## Expiration strategy 1. Integrate active Redis periodically performs active expiration and deletion of hash keys that contain expired fields, with a maximum attempt limit. 3. Lazy expiration When a client touches fields within a hash, Redis checks if the fields are expired. If a field is expired, it will be deleted. However, we do not delete expired fields during a traversal, we implicitly skip over them. ## RDB changes Add two new rdb type s`RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_EX`. ## Notification 1. Add `hpersist` notification for `HPERSIST` command. 5. Add `hexpire` notification for `HEXPIRE`, `HEXPIREAT`, `HPEXPIRE` and `HPEXPIREAT` commands. ## Internal 1. Add new data structure `ebuckets`, which is used to store TTL and keys, enabling quick retrieval of keys based on TTL. 2. Add new data structure `mstr` like sds, which is used to store a string with TTL. This work was done by @moticless, @tezc, @ronen-kalish, @sundb, I just release it.	2024-05-30 15:26:19 +08:00
Moti Cohen	33fc0fbfae	HFE to support AOF and replicas (#13285 ) * For replica sake, rewrite commands `HEXPIRE` , `HSETF`, `HGETF` to have absolute unix time in msec. * On active-expiration of field, propagate HDEL to replica (`propagateHashFieldDeletion()`) * On lazy-expiration, propagate HDEL to replica (`hashTypeGetValue()` now calls `hashTypeDelete()`. It also takes care to call `propagateHashFieldDeletion()`). * Fix `HEXPIRE` command such that if it gets flag `LT` and it doesn’t have any expiration on the field then it will considered as valid condition. Note, replicas doesn’t make any active expiration, and should avoid lazy expiration. On `hashTypeGetValue()` it doesn't check expiration (As long as the master didn’t request to delete the field, it is valid) TODO: * Attach `dbid` to HASH metadata. See [here](https://github.com/redis/redis/pull/13209#discussion_r1593385850) --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-05-29 19:47:48 +08:00
Ozan Tezcan	2f34f6f0b9	Delete hsetf and hgetf (#13291 ) Changes: - Delete hsetf and hgetf commands - Hfe commands will return empty array instead of nil. --------- Co-authored-by: Moti Cohen <moticless@gmail.com>	2024-05-26 13:30:45 +03:00
Moti Cohen	60e1582ddb	Fix statistics test (#13293 )	2024-05-26 03:34:15 +03:00
Moti Cohen	f34f2ade85	Add Statistics hashes_with_expiry_fields to INFO (#13275 ) Added hashes_with_expiry_fields. Optimially it would better to have statistic of that counts all fields with expiry. But it requires careful logic and computation to follow and deep dive listpacks and hashes. This statistics is trivial to achieve and reflected by global HFE DS that has builtin enumeration of all the hashes that are registered in it.	2024-05-23 17:29:45 +03:00
debing.sun	95cbe879e5	sanitize dump payload for HFE (#13278 ) Add the following validations: 1. Get TTL using the lpGetIntegerValue() method instead of lpGetValue(), Ref https://github.com/redis/redis/pull/13209#discussion_r1602569422 2. The TTL of listpackex is a number in the valid range (0~EB_EXPIRE_TIME_MAX) and ordered. 3. The TTL fields of listpackex are ordered. 4. The TTL of hashtable is within the valid range (0~EB_EXPIRE_TIME_MAX). Other: Fix the missing of handling OBJ_ENCODING_LISTPACK_EX in dismissHashObject(). --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>	2024-05-22 10:53:30 +08:00
Moti Cohen	71676513dd	Fix commands HEXPIRE and H*TTL to include `FIELDS` constant (#13270 ) The same goes to: HPEXPIRE, HEXPIREAT, HPEXPIREAT, HEXPIRETIME, HPEXPIRETIME, HPTTL, HTTL, HPERSIST	2024-05-16 19:35:58 +03:00
debing.sun	ffbdf2f6f3	Fix test failure due to differing reply format of XREADGROUP under RESP3 in MULTI (#13255 ) This test was introducted by #13251. Normally we auto transform the reply format of XREADGROUP to array under RESP3 (see trasformer_funcs). But when we execute XREADGROUP command in multi it can't work, which cause the new test failed. The solution is to verity the reply of XREADGROUP in advance rather than in MULTI. Failed validate schema CI: https://github.com/redis/redis/actions/runs/9025128323/job/24800285684 --------- Co-authored-by: guybe7 <guy.benoish@redislabs.com>	2024-05-14 20:08:32 +08:00
Ozan Tezcan	5066e6e9cd	Fix hgetf/hsetf reply type by returning string (#13263 ) If encoding is listpack, hgetf and hsetf commands reply field value type as integer. This PR fixes it by returning string. Problematic cases: ``` 127.0.0.1:6379> hset hash one 1 (integer) 1 127.0.0.1:6379> hgetf hash fields 1 one 1) (integer) 1 127.0.0.1:6379> hsetf hash GETOLD fvs 1 one 2 1) (integer) 1 127.0.0.1:6379> hsetf hash DOF GETNEW fvs 1 one 2 1) (integer) 2 ``` Additional fixes: - hgetf/hsetf command description text Fixes #13261, #13262	2024-05-13 11:09:49 +03:00
Ozan Tezcan	ca4ed48db6	Add listpack support, hgetf and hsetf commands (#13209 ) Changes: - Adds listpack support to hash field expiration - Implements hgetf/hsetf commands Listpack support for hash field expiration We keep field name and value pairs in listpack for the hash type. With this PR, if one of hash field expiration command is called on the key for the first time, it converts listpack layout to triplets to hold field name, value and ttl per field. If a field does not have a TTL, we store zero as the ttl value. Zero is encoded as two bytes in the listpack. So, once we convert listpack to hold triplets, for the fields that don't have a TTL, it will be consuming those extra 2 bytes per item. Fields are ordered by ttl in the listpack to find the field with minimum expiry time efficiently. New command implementations as part of this PR: - HGETF command For each specified field get its value and optionally set the field's expiration time in sec/msec /unix-sec/unix-msec: ``` HGETF key [NX \| XX \| GT \| LT] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| PERSIST] <FIELDS count field [field ...]> ``` - HSETF command For each specified field value pair: set field to value and optionally set the field's expiration time in sec/msec /unix-sec/unix-msec: ``` HSETF key [DC] [DCF \| DOF] [NX \| XX \| GT \| LT] [GETNEW \| GETOLD] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] <FVS count field value [field value …]> ``` Todo: - Performance improvement. - rdb load/save - aof - defrag	2024-05-08 23:11:32 +03:00
guybe7	0e1de78fca	XREADGROUP from PEL should not affect server.dirty (#13251 ) Because it does not cause any propagation (arguably it should, see the comment in the tcl file) The motivation for this fix is that in 6.2 if dirty changed without propagation inside MULTI/EXEC it would cause propagation of EXEC only, which would result in the replica sending errors to its master	2024-05-06 16:55:42 +08:00
debing.sun	03cd525ffa	Fix reply schema for hfe related commands (#13238 )	2024-05-03 11:11:41 +08:00
Moti Cohen	c18ff05665	Hash Field Expiration - Basic support - Add ebuckets & mstr data structures - Integrate active & lazy expiration - Add most of the commands - Add support for dict (listpack is missing) TODOs: RDB, notification, listpack, HSET, HGETF, defrag, aof	2024-04-18 16:06:30 +03:00
Yanqi Lv	bad33f8738	fix wrong data type conversion in zrangeResultBeginStore (#13148 ) In `beginResultEmission`, -1 means the result length is not known in advance. But after #12185, if we pass -1 to `zrangeResultBeginStore`, it will convert to SIZE_MAX in `zsetTypeCreate` and try to `dictExpand`. Although `dictExpand` won't succeed because the size overflows, I think we'd better to avoid this wrong conversion. This bug can be triggered when the source of `zrangestore` doesn't exist or we use `zrangestore` command with `byscore` or `bylex`. The impact is that dst keys will be converted to use skiplist instead of listpack.	2024-03-19 08:52:55 +02:00
Ronen Kalish	a8e745117f	Xread last entry in stream (#7388 ) (#13117 ) Allow using `+` as a special ID for last item in stream on XREAD command. This would allow to iterate on a stream with XREAD starting with the last available message instead of the next one which `$` is used for. I.e. the caller can use `BLOCK` and `+` on the first call, and change to `$` on the next call. Closes #7388 --------- Co-authored-by: Felipe Machado <462154+felipou@users.noreply.github.com>	2024-03-13 08:23:32 +02:00
Binbin	f17381a38d	Fix propagation of entries_read by calling streamPropagateGroupID unconditionally (#12898 ) In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate entries-read, entries-read will be inconsistent between master and replicas. I.e. if no entries were claimed, it would have propagated correctly, but if some were claimed, then the entries-read field would be inconsistent on the replica. The fix was suggested by guybe7, call streamPropagateGroupID unconditionally, so that we will normalize entries_read on the replicas. In the past, we would only set propagate_last_id when NOACK was specified. And in #9127, XCLAIM did not propagate entries_read in ACK, which would cause entries_read to be inconsistent between master and replicas. Another approach is add another arg to XCLAIM and let it propagate entries_read, but we decided not to use it. Because we want minimal damage in case there's an old target and new source (in the worst case scenario, the new source doesn't recognize XGROUP SETID ... ENTRIES READ and the lag is lost. If we change XCLAIM, the damage is much more severe). In this patch, now if the user uses XREADGROUP .. COUNT 1 there will be an additional overhead of MULTI, EXEC and XGROUPSETID. We assume the extra command in case of COUNT 1 (4x factor, changing from one XCLAIM to MULTI+XCLAIM+XSETID+EXEC), is probably ok since reading just one entry is in any case very inefficient (a client round trip per record), so we're hoping it's not a common case. Issue was introduced in #9127.	2024-02-29 09:48:20 +02:00
debing.sun	165afc5f2a	Determine the large limit of the quicklist node based on fill (#12659 ) Following #12568 In issue #9357, when inserting an element larger than 1GB, we currently store it in a plain node instead of a listpack. Presently, when we insert an element that exceeds the maximum size of a packed node, it cannot be accommodated in any other nodes, thus ending up isolated like a large element. I.e. it's a node with only one element, but it's listpack encoded rather than a plain buffer. This PR lowers the threshold for considering an element as 'large' from 1GB to the maximum size of a node. While this change doesn't completely resolve the bug mentioned in the previous PR, it does mitigate its potential impact. As a result of this change, we can now only use LSET to replace an element with another element that falls below the maximum size threshold. In the worst-case scenario, with a fill of -5, the largest packed node we can create is 2GB (32k * 64k): * 32k: The smallest element in a listpack is 2 bytes, which allows us to store up to 32k elements. * 64k: This is the maximum size for a single quicklist node. ## Others To fully fix #9357, we need more work, as discussed in #12568, when we insert an element into a quicklistNode, it may be created in a new node, put into another node, or merged, and we can't correctly delete the node that was supposed to be deleted. I'm not sure it's worth it, since it involves a lot of modifications.	2024-02-22 10:02:38 +02:00
Binbin	32f44da510	Increase tolerance range to block reprocess tests to avoid timing issues (#13053 ) These tests have all failed in daily CI: ``` * [err]: Blocking XREADGROUP for stream key that has clients blocked on stream - reprocessing command in tests/unit/type/stream-cgroups.tcl Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test) * [err]: BLPOP unblock but the key is expired and then block again - reprocessing command in tests/unit/type/list.tcl Expected '1101' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test) *** [err]: BZPOPMIN unblock but the key is expired and then block again - reprocessing command in tests/unit/type/zset.tcl Expected '1103' to be between to '1000' and '1100' (context: type eval line 23 cmd {assert_range [expr $end-$start] 1000 1100} proc ::test) ``` Increase the range to avoid failures, and improve the comment to be clearer. tests was introduced in #13004.	2024-02-15 10:44:49 +02:00
debing.sun	1e8dc1da0d	Fix crash due to merge of quicklist node introduced by #12955 (#13040 ) Fix two crash introducted by #12955 When a quicklist node can't be inserted and split, we eventually merge the current node with its neighboring nodes after inserting, and compress the current node and its siblings. 1. When the current node is merged with another node, the current node may become invalid and can no longer be used. Solution: let `_quicklistMergeNodes()` return the merged nodes. 3. If the current node is a LZF quicklist node, its recompress will be 1. If the split node can be merged with a sibling node to become head or tail, recompress may cause the head and tail to be compressed, which is not allowed. Solution: always recompress to 0 after merging.	2024-02-08 14:29:16 +02:00
debing.sun	1f00c951c2	Prevent LSET command from causing quicklist plain node size to exceed 4GB (#12955 ) Fix #12864 The main reason for this crash is that when replacing a element of a quicklist packed node with lpReplace() method, if the final size is larger than 4GB, lpReplace() will fail and returns NULL, causing `node->entry` to be incorrectly set to NULL. Since the inserted data is not a large element, we can't just replace it like a large element, first quicklistInsertAfter() and then quicklistDelIndex(), because the current node may be merged and invalidated in quicklistInsertAfter(). The solution of this PR: When replacing a node fails (listpack exceeds 4GB), split the current node, create a new node to put in the middle, and try to merge them. This is the same as inserting a large element. In the worst case, its size will not exceed 4GB.	2024-02-06 18:21:28 +02:00
Binbin	492021db95	Fix blocking commands timeout is reset due to re-processing command (#13004 ) In #11012, we will reprocess command when client is unblocked on keys, in some blocking commands, for example, in the XREADGROUP BLOCK scenario, because of the re-processing command, we will recalculate the block timeout, causing the blocking time to be reset. This commit add a new CLIENT_REPROCESSING_COMMAND clent flag, explicitly let the command know that it is being re-processed, later in blockForKeys we will not reset the timeout. Affected BLOCK cases: - list / zset / stream, added test cases for each. Unaffected cases: - module (never re-process the commands). - WAIT / WAITAOF (never re-process the commands). Fixes #12998.	2024-01-30 11:32:59 +02:00
Roshan Khatri	5358bd7cdd	Reduce performance impact of dict rehashing and make it shorter. (#12899 ) #### Problem Statement: For any read/update operation during rehashing, we're doing ~10+ random DRAM lookups to do the rehashing, as we are using the `rehashidx` to rehash 10 buckets, whose dict entries most likely aren't cached in the CPU or near the bucket we are operating on. If these random bucket are empty, the rehashing process during that command execution is skipped. #### Implementation: For reducing the performance recession while dict is rehashing, we determine the index at which the key would be stored in the 0th HT, we check if that index has already been rehashed, if not we will rehash the bucket containing the key and the bucket will be moved from 0th HT to the 1st HT. If the key has already been rehashed, we perform the random access bucket rehash (using `rehashidx`) and we again verify if rehashing is still ongoing and look up the key in the respective HT. This ensures rehashing is not skipped in any command call and that we rehash a particular bucket or random bucket in each call. #### Changes in this PR: - Added a new method `dictBucketRehash` to perform rehash on a single bucket. - Helper function `moveKeysInBucketOldtoNew` for `dictRehash` and `dictBucketRehash` to move all the keys in a bucket from the old to the new hash HT. - Helper function `verifyMoreRehashRequired` for `dictRehash` and `dictBucketRehash` to check if we have already rehashed the whole table and if more rehashing is required. ### Benchmark: - This PR still shows ~13% improvement in the latency during rehashing. - Rehashing is now ~2% faster for this PR when compared to unstable. --------- Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>	2024-01-27 11:11:53 +02:00
Wen Hui	685409139b	Add INCR type command against wrong argument test cases. (#12836 ) We have test cases for incr related commands with no key exist and spaces in key and wrong type of key. However, we dont have test cases covered for INCRBY INCRBYFLOAT DECRBY INCR DECR HINCRBY HINCRBYFLOAT ZINCRBY with valid key and invalid value as argument, and float value to incrby and decrby. So added test cases for the scenarios in incr.tcl. Thank you!	2024-01-23 15:39:38 +02:00
Yanqi Lv	e2b7932b34	Shrink dict when deleting dictEntry (#12850 ) When we insert entries into dict, it may autonomously expand if needed. However, when we delete entries from dict, it doesn't shrink to the proper size. If there are few entries in a very large dict, it may cause huge waste of memory and inefficiency when iterating. The main keyspace dicts (keys and expires), are shrinked by cron (`tryResizeHashTables` calls `htNeedsResize` and `dictResize`), And some data structures such as zset and hash also do that (call `htNeedsResize`) right after a loop of calls to `dictDelete`, But many other dicts are completely missing that call (they can only expand). In this PR, we provide the ability to automatically shrink the dict when deleting. The conditions triggering the shrinking is the same as `htNeedsResize` used to have. i.e. we expand when we're over 100% utilization, and shrink when we're below 10% utilization. Additionally: * Add `dictPauseAutoResize` so that flows that do mass deletions, will only trigger shrinkage at the end. * Rename `dictResize` to `dictShrinkToFit` (same logic as it used to have, but better name describing it) * Rename `_dictExpand` to `_dictResize` (same logic as it used to have, but better name describing it) related to discussion https://github.com/redis/redis/pull/12819#discussion_r1409293878 --------- Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2024-01-15 08:20:53 +02:00
Wen Hui	5dc631d880	Add missing test cases for hash commands (#12851 ) We dont have test for hgetall against key doesnot exist so added the test in test suite and along with this, added wrong type cases for other missing commands.	2023-12-17 14:02:53 +02:00
Binbin	96e9dec419	Bump codespell from 2.2.4 to 2.2.5 (#12557 ) and adjustments.	2023-09-08 16:10:17 +03:00
Chen Tianjie	b26e8e3213	Optimize ZRANGE offset location from linear search to skiplist jump. (#12450 ) ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was using every level in skiplist to jump to the first/last node in range, but only use level[0] in skiplist to locate the node at offset, resulting in sub-optimal performance using LIMIT: ``` while (ln && offset--) { if (reverse) { ln = ln->backward; } else { ln = ln->level[0].forward; } } ``` It could be slow when offset is very big. We can get the total rank of the offset location and use skiplist to jump to it. It is an improvement from O(offset) to O(log rank). Below shows how this is implemented (if the offset is positve): Use the skiplist to seach for the first element in the range, record its rank `rank_0`, so we can have the rank of the target node `rank_t`. Meanwhile we record the last node we visited which has zsl->level-1 levels and its rank `rank_1`. Then we start from the zsl->level-1 node, use skiplist to go forward `rank_t-rank_1` nodes to reach the target node. It is very similiar when the offset is reversed. Note that if `rank_t` is very close to `rank_0`, we just start from the first element in range and go node by node, this for the case when zsl->level-1 node is to far away and it is quicker to reach the target node by node. Here is a test using a random generated zset including 10000 elements (with different positive scores), doing a bench mark which compares how fast the `ZRANGE` command is exucuted before and after the optimization. The start score is set to 0 and the count is set to 1 to make sure that most of the time is spent on locating the offset. ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 73386.02 \| 74819.82 \| \| 1000 \| 48084.96 \| 73177.73 \| \| 2000 \| 31156.79 \| 72805.83 \| \| 5000 \| 10954.83 \| 71218.21 \| With the result above, we can see that the original code is greatly slowed down when offset gets bigger, and with the optimization the speed is almost not affected. Similiar results are generated when testing reversed offset: ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 74505.14 \| 71653.67 \| \| 1000 \| 46829.25 \| 72842.75 \| \| 2000 \| 28985.48 \| 73669.01 \| \| 5000 \| 11066.22 \| 73963.45 \| And the same conclusion is drawn from the tests of ZRANGE BYLEX.	2023-08-31 14:42:08 +03:00
Harkrishn Patro	42985b00ea	Test coverage for incr/decr operation on robj encoding type optimization (#12435 ) Additional test coverage for incr/decr operation. integer number could be present in raw encoding format due to operation like append. A incr/decr operation following it optimize the string to int encoding format.	2023-07-25 16:43:31 -07:00
Harkrishn Patro	34b95f752c	Add test case for APPEND command usage on integer value (#12429 ) Add test coverage to validate object encoding update on APPEND command usage on a integer value	2023-07-24 18:25:50 -07:00
Binbin	6b57241fa8	Revert zrangeGenericCommand negative offset check (#12377 ) The negative offset check was added in #9052, we realized that this is a non-mandatory breaking change and we would like to add it only in 8.0. This reverts PR #9052, will be re-introduced later in 8.0.	2023-07-02 20:38:30 +03:00
Binbin	d9c2ef8a72	zrangeGenericCommand add check for negative offset (#9052 ) Now we will check the offset in zrangeGenericCommand. With a negative offset, we will throw an error and return. This also resolve the issue of zeroing the destination key in case of the "store" variant when we input a negative offset. ``` 127.0.0.1:6379> set key value OK 127.0.0.1:6379> zrangestore key myzset 0 10 byscore limit -1 10 (integer) 0 127.0.0.1:6379> exists key (integer) 0 ``` This change affects the following commands: - ZRANGE / ZRANGESTORE / ZRANGEBYLEX / ZRANGEBYSCORE - ZREVRANGE / ZREVRANGEBYSCORE / ZREVRANGEBYLEX	2023-06-20 14:33:17 +03:00

1 2 3 4 5 ...

401 Commits