mirror of
https://github.com/JHUAPL/kvspool.git
synced 2026-01-08 22:27:56 -05:00
369 lines
16 KiB
Plaintext
369 lines
16 KiB
Plaintext
kvspool data stream tools
|
|
=========================
|
|
Troy D. Hanson <tdh@tkhanson.net>
|
|
|
|
Back to the https://github.com/troydhanson/kvspool[kvspool Github page]. Back to
|
|
http://troydhanson.github.io[my other software].
|
|
|
|
kv-spool ("key-value" spool)::
|
|
a Linux-based C library to stream data between programs as key-value dictionaries.
|
|
It has tools to pipe the streams around, such as by network pub/sub or tee, to
|
|
concentrate streams together, and to monitor the consumption status of streams.
|
|
|
|
kvspool's niche
|
|
---------------
|
|
Kvspool is a tiny API to stream dictionaries between programs. The dictionaries have
|
|
text keys and values. The dictionary is a set of key-value pairs.
|
|
|
|
To use kvspool, two programs open the same spool- which is just a directory. The writer
|
|
puts dictionaries into the spool. The reader gets dictionaries from the spool. It blocks
|
|
when it's caught up, waiting for more data. Like this,
|
|
|
|
image:reader-writer.png[A spool writer and reader]
|
|
|
|
.Why did I write kvspool?
|
|
*******************************************************************************
|
|
I wanted a very simple library that only writes to the local file system, so
|
|
applications can use kvspool without having to set anything up ahead of time-
|
|
no servers to run, no configuration files. I wanted no "side effects" to happen
|
|
in my programs-- no thread creation, no sockets, nothing going on underneath.
|
|
I wanted fewer rather than more features- taking the Unix pipe as a role model.
|
|
*******************************************************************************
|
|
|
|
.Loose coupling
|
|
Because the spooled data goes into the disk, the reader and writer are decoupled. They
|
|
don't have to run at the same time. They can come and go. Also, if the reader exits and
|
|
restarts, it picks up where it left off.
|
|
|
|
Space management
|
|
~~~~~~~~~~~~~~~~
|
|
In a streaming data environment, the writer and reader can run for months on end. A key
|
|
requirement is that we don't fill up the disk. So, when we make a spool directory, we
|
|
tell kvspool what its maximum size should be:
|
|
|
|
% mkdir spool
|
|
% kvsp-init -s 1G spool
|
|
|
|
This configures a maximum of 1 GB of data in the spool. After it reaches that size, it
|
|
stays that size, by deleting old data as new data comes into the spool. Deletion is
|
|
done in units that are about 10% of the total spool size.
|
|
|
|
A reader that's offline long enough can eventually lose data- that is, miss the
|
|
opportunity to read it before it's deleted. Data loss is a deliberate feature - otherwise
|
|
it'd be necessary to block the writer or use unbounded disk space.
|
|
|
|
Shared memory I/O
|
|
~~~~~~~~~~~~~~~~~
|
|
You can locate a spool on a RAM disk if you want the speed of shared memory without true
|
|
disk persistence- kvspool comes with a `ramdisk` utility to make one.
|
|
|
|
Rewind and replay
|
|
~~~~~~~~~~~~~~~~~
|
|
After data has been read from the spool, it remains in the spool directory until a future
|
|
time when kvspool deletes it to make room for new data. This is the basis for a handy
|
|
"rewind and replay" feature:
|
|
|
|
% kvsp-rewind spool
|
|
|
|
After a rewind, reading starts at the beginning of the spool. The reader should be
|
|
restarted for this to take effect. Rewind and replay is a convenient way to develop a
|
|
program on a consistent source of test data.
|
|
|
|
Snapshot
|
|
~~~~~~~~
|
|
A snapshot is just a copy of the spool. You can bring the copy back to a development
|
|
environment, rewind it, and use it as a consistent source of test data.
|
|
|
|
% cp -r spool snapshot
|
|
|
|
Copying a spool of production data lets you develop programs on real data but without
|
|
needing a writer present- the data has been "canned".
|
|
|
|
Local and Network Replication
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
You can also publish a spool over the network, like this:
|
|
|
|
% kvsp-pub -d spool tcp://192.168.1.9:1110
|
|
|
|
Now, on the remote computers where you wish to subscribe to the spool, run:
|
|
|
|
% kvsp-sub -d spool tcp://192.168.1.9:1110
|
|
|
|
If you give multiple addresses to `kvsp-sub`, it connects to all of them and concentrates
|
|
their published output into a single spool.
|
|
|
|
% kvsp-sub -d spool tcp://192.168.1.9:1110 tcp://192.168.1.10:1111
|
|
|
|
Kvspool includes a few kinds of replication utilities described <<net_utilities,below>>.
|
|
|
|
.The big picture
|
|
*******************************************************************************
|
|
Before moving on- let's take a deep breath and recap. With kvspool, the writer
|
|
is unaware of whether network replication is taking place. The writer just writes
|
|
to the local spool. We run the `kvsp-pub` utility in the background; as data
|
|
comes into the spool, it transmits it on the network.
|
|
|
|
On the other computer- the receiving side- we run `kvsp-sub` in the background.
|
|
It receives the network transmissions, and writes them to its local spool.
|
|
|
|
Use `kvsp-pub` and `kvsp-sub` to maintain a live, continuous replication. As
|
|
data is written to the source spool, it just "shows up" in the remote spool.
|
|
The reader and writer are completely uninvolved in the replication process.
|
|
*******************************************************************************
|
|
|
|
image:pub-sub.png[Publish and Subscribe]
|
|
|
|
You can run `kvsp-sub` and `kvsp-pub` in the background at system boot by setting
|
|
them up as services or http://troydhanson.github.com/pmtr/[pmtr] jobs.
|
|
|
|
License
|
|
~~~~~~~
|
|
Kvspool uses the MIT license.
|
|
|
|
Getting kvspool
|
|
---------------
|
|
You can clone kvspool from github:
|
|
|
|
% git clone git://github.com/troydhanson/kvspool.git
|
|
|
|
To build and install kvspool, you need autotools installed. The configure script looks
|
|
to see if optional libraries including ZeroMQ, Nanomsg, Jansson and librdkafka re
|
|
installed. It builds additional utilities if they are present.
|
|
|
|
% cd kvspool
|
|
|
|
% ./autogen.sh
|
|
% ./configure
|
|
% make
|
|
% sudo make install
|
|
|
|
|
|
Utilities
|
|
---------
|
|
|
|
Basic
|
|
~~~~~
|
|
|
|
[width="90%",cols="20m,50m",grid="none",options="header"]
|
|
|===============================================================================
|
|
|command | example
|
|
|kvsp-init | kvsp-init -s 1G spool
|
|
|kvsp-status | kvsp-status spool
|
|
|kvsp-rewind | kvsp-rewind spool
|
|
|kvsp-tee | kvsp-tee -s spool copy1 copy2
|
|
|kvsp-concen | kvsp-concen -d spool1 -d spool2 spool
|
|
|kvsp-bcat | kvsp-bcat -b config -d spool
|
|
|===============================================================================
|
|
|
|
The `kvsp-init` command is used when a spool directory is first created, to set
|
|
the maximum capacity of the spool. It accepts k/m/g/t suffixes. If `kvsp-init` is
|
|
run later, after the spool already exists and has data, it is resized.
|
|
|
|
Run `kvsp-status` to see what percentage of the spool has been consumed by a reader. It
|
|
can take multiple spools as arguments. For each spool it prints a line like:
|
|
|
|
/tmp/spool 99% 877kb 28secs
|
|
|
|
The fields show what percentage of the spool has been read, how much data is currently in
|
|
the spool, and how recently data was written into the spool.
|
|
|
|
The `kvsp-rewind` command resets the reader position to the beginning (oldest frame) in the
|
|
spool. Use this command in order to "replay" the spooled data. Disconnect (terminate) any
|
|
readers before running this command.
|
|
|
|
Use `kvsp-tee` to support multiple readers from one input spool. First make a separate
|
|
spool directory for each reader (and use `kvsp-init` to set the capacity of each one);
|
|
then use `kvsp-tee` as the reader on the source spool. It maintains a continuous copy of
|
|
the spool to the multiple destination spools. This command needs to be left running to
|
|
maintain the tee. The `-W` flag can be used to run tee in the more efficient raw mode,
|
|
which copies the binary input to the binary output representation in the spool file.
|
|
It is possible to filter the incoming data using `-k key -r regex` options. In this
|
|
usage the tee only passes a dictionary if it has the key and its value matches regex.
|
|
|
|
The `kvsp-concen` utility is the opposite of `kvsp-tee`. It takes multiple source
|
|
spools and makes a single output spool from them. It is a spool concentrator. The
|
|
source spools are flagged with `-d spool` and the final argument is the output spool.
|
|
|
|
The `kvsp-bcat` command operates like `kvsp-bpub` (see below). It writes the binary
|
|
encoded spool content to standard output.
|
|
|
|
[[net_utilities]]
|
|
Network utilities
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
[width="90%",cols="10m,50m",grid="none",options="header"]
|
|
|===============================================================================
|
|
|command | example
|
|
|kvsp-pub | kvsp-pub -d spool tcp://192.168.1.9:1110
|
|
|kvsp-sub | kvsp-sub -d spool tcp://192.168.1.9:1110
|
|
|kvsp-bpub | kvsp-bpub -b cast.cfg -d spool tcp://192.168.1.9:2110
|
|
|kvsp-bsub | kvsp-bsub -b cast.cfg -d spool tcp://192.168.1.9:2110
|
|
|kvsp-kkpub | kvsp-kkpub -b kafka.host.name -t topic
|
|
|kvsp-tpub | kvsp-tpub -b cast.cfg -d spool -p 2110
|
|
|===============================================================================
|
|
|
|
The network utilities keep a local spool continuously replicated to a remote spool.
|
|
|
|
kvsp-pub/kvsp-sub
|
|
^^^^^^^^^^^^^^^^^
|
|
The utilities `kvsp-pub` and `kvsp-sub` exist to publish a source spool to a remote spool.
|
|
This pair of utilities communicates in JSON over ZeroMQ. The publisher listens on the
|
|
specified TCP port. The subscriber connects to it. If more than one subscriber connects,
|
|
each receives a copy of the data. When no subscribers are connected, the publisher drops
|
|
data. However, when the `-s` flag is given to both `kvsp-pub` and `kvsp-sub`, two things
|
|
change: the publisher queues data when waiting for a subscriber instead of dropping it;
|
|
and secondly, if more than one subscriber connects, the data gets divided among them
|
|
rather than duplicated to all of them.
|
|
|
|
A subscriber can concentrate data (that is, "fan-in" the data) from many publishers,
|
|
simply by listing multiple ZeroMQ endpoints on the command line.
|
|
|
|
kvsp-bpub/kvsp-bsub
|
|
^^^^^^^^^^^^^^^^^^^
|
|
The `kvsp-bpub` and `kvsp-bsub` utilities implement binary-over-ZeroMQ replication. Their
|
|
usage is the same as `kvsp-pub` and `kvsp-sub` except the required `-b cast.cfg` argument.
|
|
This file lists the binary data type for each required key. Here's an example `cast.cfg`:
|
|
|
|
i32 timestamp
|
|
str sensor_name
|
|
i8 temp
|
|
|
|
In this example, every dictionary in the source spool is expected to have three keys
|
|
(timestamp, sensor_name and temp); any other keys get ignored. Their values are parsed
|
|
to the binary types listed (i32, str and i8). The binary buffer is transmitted
|
|
without any keys. On the receiving subscriber, the binary is unpacked using the same file.
|
|
The reason for using binary publishing is speed- it's often an order of magnitude faster.
|
|
These data types may appear in `cast.cfg`
|
|
|
|
i8 // byte (8-bit int)
|
|
i16 // short (16-bit int)
|
|
i32 // integer (32-bit int)
|
|
ipv4 // dotted quad IP address
|
|
str // string
|
|
d64 // double (64-bit float)
|
|
|
|
kvsp-kkpub
|
|
^^^^^^^^^^
|
|
Publishes the spool in JSON encoding to a Kakfa topic. This tool requires librdkakfa and
|
|
Jansson to be installed.
|
|
|
|
kvsp-tpub
|
|
^^^^^^^^^
|
|
Finally there is a "plain TCP" binary publisher. It has no subscriber counterpart yet, so
|
|
you have to code your own subscriber to use it. It takes a cast.cfg of the same form as above.
|
|
It listens on the specified port, and when a subscriber connects to it, the dictionaries
|
|
in the spool are transmitted as length-prefixed binary messages. The length prefix is a
|
|
32-bit integer (host-endianness) specifying the message length that follows. The remaining
|
|
binary data is transmitted in host-endianness, except IP addresses in network order.
|
|
|
|
[[other_utilities]]
|
|
Other utilities
|
|
~~~~~~~~~~~~~~~
|
|
|
|
[width="90%",cols="10m,50m",grid="none",options="header"]
|
|
|===============================================================================
|
|
|command | example
|
|
|kvsp-spr | kvsp-spr -B 0 spool
|
|
|kvsp-spw | kvsp-spw -i 10 spool
|
|
|kvsp-mod | kvsp-mod -k key -o spool2 spool
|
|
|kvsp-speed | kvsp-speed
|
|
|ramdisk | ramdisk -c -s 1G /mnt/ramdisk
|
|
|===============================================================================
|
|
|
|
The `kvsp-spr` utility is used to manually read a spool and print its frames to the
|
|
screen. Normally it will block waiting for data once it reaches the end of the spool but
|
|
the `-B 0` (no-block) option tells it to stop reading if the end of the spool is reached.
|
|
|
|
The `kvsp-spw` utility is used only for testing; it writes a frame of data to the spool
|
|
(or several frames if the `-i` option is used with a count); in the latter mode there is a
|
|
sleep (delay) between each frame, which can be adjusted using the `-d <seconds>` option.
|
|
|
|
The `kvsp-mod` command "obfuscates" selected values from a source spool to hash values
|
|
in the output spool (named with the `-o` option). For each key named with the `-k` flag,
|
|
its value in the output spool is replaced with a mathematical hash. The hash numbers
|
|
preserve consistency (so the same input value produces the same output value) but the
|
|
value itself is a meaningless number.
|
|
|
|
A simple benchmark is performed by the `kvsp-speed` utility to measure read and write
|
|
performance.
|
|
|
|
The `ramdisk` utility creates, queries or unmounts a ramdisk- a Linux tmpfs filesystem.
|
|
In the form shown in the table above it creates a 1G ramdisk on the `/mnt/ramdisk` mount
|
|
point (this directory must already exist). A ramdisk created this way will appear in the
|
|
`/proc/mounts` listing. If a ramdisk already exists on that mount point, the command does
|
|
nothing. In create (`-c`) mode, the `-d <dir>` option may be used one or more times to
|
|
specify (as absolute paths) directories to create within the ramdisk. Using `ramdisk -u
|
|
/mnt/ramdisk` unmounts it. The `-q` option queries a directory to see if its a ramdisk
|
|
and show its size. The `ramdisk` utility is included with kvspool because it is often
|
|
convenient to locate a spool on a ramdisk for performance.
|
|
|
|
API
|
|
---
|
|
|
|
C programs must be linked with -lkvspool.
|
|
|
|
[source,c]
|
|
#include "kvspool.h"
|
|
...
|
|
void *sp = kv_spoolwriter_new("spool");
|
|
void *set = kv_set_new();
|
|
kv_adds(set, "day", "Wednesday");
|
|
kv_adds(set, "user", "Troy");
|
|
kv_spool_write(sp,set);
|
|
...
|
|
kv_set_free(set);
|
|
kv_spoolwriter_free(sp);
|
|
|
|
In C, a spool is opened for writing by calling `kv_spoolwriter_new` which takes the spool
|
|
directory as argument. It returns an opaque handle (which you should eventually free with
|
|
`kv_spoolwriter_free`).
|
|
|
|
Kvspool provides a data structure that implements a dictionary in C. It is created using
|
|
`kv_set_new` which returns an opaque handle (which should eventually be freed using
|
|
`kv_set_free`). In the meantime, it can be used for the whole lifetime of the program, by
|
|
adding key-value pairs to it, writing them out, clearing it, and reusing it over and over.
|
|
To add a key-value pair, use `kv_adds` which takes the set handle, the key and the value.
|
|
The key and value get 'copied' into the set (the set does not keep a pointer to them).
|
|
To write the set into the spool, call `kv_spool_write` with the spool and set handle.
|
|
To re-use the set, call `kv_set_clear` with the set handle. For debugging you can dump a
|
|
set to stderr using `kv_set_dump(set,stderr)`.
|
|
|
|
To open a spool for reading, call `kv_spoolreader_new` which takes the spool directory and
|
|
returns an opaque handle to the spool. Then call `kv_spool_read` to read the spool.
|
|
|
|
[source,c]
|
|
void *sp = kv_spoolreader_new("spool");
|
|
void *set = kv_set_new();
|
|
kv_spool_read(sp,set,1);
|
|
|
|
The final argument to `kv_spool_read` specifies whether it should block if there is no
|
|
data ready in the spool. A positive return value means success (data was read from the
|
|
spool and it's been populated into the set). A zero value means that non-blocking mode
|
|
was used, but no data is currently available in the spool.
|
|
|
|
A C program can iterate through all the key-value pairs in the result set like this:
|
|
|
|
[source,c]
|
|
kv_t *kv = NULL;
|
|
while ( (kv = kv_next(set, kv))) {
|
|
printf("key is %s\n", kv->key);
|
|
printf("value is %s\n", kv->val);
|
|
}
|
|
|
|
You can also fetch a particular key-value pair from the set using `kv_get`, providing
|
|
the key as the final argument:
|
|
|
|
kv_t *kv = kv_get(set, "user");
|
|
|
|
The number of key-value pairs in the set can be obtained using `kv_len`:
|
|
|
|
int count = kv_len(set);
|
|
|
|
The C API also has a function to rewind the spool, which works if there is no reader that
|
|
has the spool open at the time. It takes the spool directory as its only argument.
|
|
|
|
sp_reset(dir);
|
|
|
|
// vim: set tw=90 wm=2 syntax=asciidoc:
|
|
|