kvspool/doc/index.txt

kvspool data stream tools
=========================
Troy D. Hanson <tdh@tkhanson.net>

Back to the https://github.com/troydhanson/kvspool[kvspool Github page].  Back to
http://troydhanson.github.io[my other software].

kv-spool ("key-value" spool)::
         a Linux-based C library to stream data between programs as key-value dictionaries.
         It has tools to pipe the streams around, such as by network pub/sub or tee, to
         concentrate streams together, and to monitor the consumption status of streams.

kvspool's niche
---------------
Kvspool is a tiny API to stream dictionaries between programs.  The dictionaries have
text keys and values. The dictionary is a set of key-value pairs.

To use kvspool, two programs open the same spool- which is just a directory. The writer
puts dictionaries into the spool.  The reader gets dictionaries from the spool. It blocks
when it's caught up, waiting for more data.  Like this,

image:reader-writer.png[A spool writer and reader]

.Why did I write kvspool?
*******************************************************************************
I wanted a very simple library that only writes to the local file system, so
applications can use kvspool without having to set anything up ahead of time-
no servers to run, no configuration files. I wanted no "side effects" to happen
in my programs-- no thread creation, no sockets, nothing going on underneath.
I wanted fewer rather than more features- taking the Unix pipe as a role model.
*******************************************************************************

.Loose coupling
Because the spooled data goes into the disk, the reader and writer are decoupled. They
don't have to run at the same time. They can come and go.  Also, if the reader exits and
restarts, it picks up where it left off.

Space management
~~~~~~~~~~~~~~~~
In a streaming data environment, the writer and reader can run for months on end.  A key
requirement is that we don't fill up the disk.  So, when we make a spool directory, we
tell kvspool what its maximum size should be:

  % mkdir spool
  % kvsp-init -s 1G spool

This configures a maximum of 1 GB of data in the spool. After it reaches that size, it
stays that size, by deleting old data as new data comes into the spool. Deletion is
done in units that are about 10% of the total spool size.

A reader that's offline long enough can eventually lose data- that is, miss the
opportunity to read it before it's deleted. Data loss is a deliberate feature - otherwise
it'd be necessary to block the writer or use unbounded disk space.

Shared memory I/O
~~~~~~~~~~~~~~~~~
You can locate a spool on a RAM disk if you want the speed of shared memory without true
disk persistence- kvspool comes with a `ramdisk` utility to make one.

Rewind and replay
~~~~~~~~~~~~~~~~~
After data has been read from the spool, it remains in the spool directory until a future
time when kvspool deletes it to make room for new data. This is the basis for a handy
"rewind and replay" feature:

  % kvsp-rewind spool

After a rewind, reading starts at the beginning of the spool. The reader should be
restarted for this to take effect. Rewind and replay is a convenient way to develop a
program on a consistent source of test data.

Snapshot
~~~~~~~~
A snapshot is just a copy of the spool. You can bring the copy back to a development
environment, rewind it, and use it as a consistent source of test data.

  % cp -r spool snapshot

Copying a spool of production data lets you develop programs on real data but without
needing a writer present- the data has been "canned".

Local and Network Replication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can also publish a spool over the network, like this:

  % kvsp-pub -d spool tcp://192.168.1.9:1110

Now, on the remote computers where you wish to subscribe to the spool, run:

  % kvsp-sub -d spool tcp://192.168.1.9:1110

If you give multiple addresses to `kvsp-sub`, it connects to all of them and concentrates
their published output into a single spool.

  % kvsp-sub -d spool tcp://192.168.1.9:1110 tcp://192.168.1.10:1111

Kvspool includes a few kinds of replication utilities described <<net_utilities,below>>.

.The big picture
*******************************************************************************
Before moving on- let's take a deep breath and recap. With kvspool, the writer
is unaware of whether network replication is taking place. The writer just writes
to the local spool. We run the `kvsp-pub` utility in the background; as data
comes into the spool, it transmits it on the network.

On the other computer- the receiving side- we run `kvsp-sub` in the background.
It receives the network transmissions, and writes them to its local spool.

Use `kvsp-pub` and `kvsp-sub` to maintain a live, continuous replication. As
data is written to the source spool, it just "shows up" in the remote spool.
The reader and writer are completely uninvolved in the replication process.
*******************************************************************************

image:pub-sub.png[Publish and Subscribe]

You can run `kvsp-sub` and `kvsp-pub` in the background at system boot by setting
them up as services or http://troydhanson.github.com/pmtr/[pmtr] jobs.

License
~~~~~~~
Kvspool uses the MIT license.

Getting kvspool
---------------
You can clone kvspool from github:

  % git clone git://github.com/troydhanson/kvspool.git

To build and install kvspool, you need autotools installed. The configure script looks
to see if optional libraries including ZeroMQ, Nanomsg, Jansson and librdkafka re
installed.  It builds additional utilities if they are present.

  % cd kvspool

  % ./autogen.sh
  % ./configure
  % make
  % sudo make install


Utilities
---------

Basic
~~~~~

[width="90%",cols="20m,50m",grid="none",options="header"]
|===============================================================================
|command     | example
|kvsp-init   | kvsp-init -s 1G spool
|kvsp-status | kvsp-status spool
|kvsp-rewind | kvsp-rewind spool
|kvsp-tee    | kvsp-tee -s spool copy1 copy2
|kvsp-concen | kvsp-concen -d spool1 -d spool2 spool
|kvsp-bcat   | kvsp-bcat -b config -d spool
|===============================================================================

The `kvsp-init` command is used when a spool directory is first created, to set
the maximum capacity of the spool. It accepts k/m/g/t suffixes. If `kvsp-init` is
run later, after the spool already exists and has data, it is resized.

Run `kvsp-status` to see what percentage of the spool has been consumed by a reader. It
can take multiple spools as arguments. For each spool it prints a line like:

  /tmp/spool            99%        877kb         28secs

The fields show what percentage of the spool has been read, how much data is currently in
the spool, and how recently data was written into the spool.

The `kvsp-rewind` command resets the reader position to the beginning (oldest frame) in the
spool. Use this command in order to "replay" the spooled data. Disconnect (terminate) any
readers before running this command.

Use `kvsp-tee` to support multiple readers from one input spool. First make a separate
spool directory for each reader (and use `kvsp-init` to set the capacity of each one);
then use `kvsp-tee` as the reader on the source spool. It maintains a continuous copy of
the spool to the multiple destination spools. This command needs to be left running to
maintain the tee. The `-W` flag can be used to run tee in the more efficient raw mode,
which copies the binary input to the binary output representation in the spool file.
It is possible to filter the incoming data using `-k key -r regex` options. In this
usage the tee only passes a dictionary if it has the key and its value matches regex.

The `kvsp-concen` utility is the opposite of `kvsp-tee`. It takes multiple source
spools and makes a single output spool from them. It is a spool concentrator. The
source spools are flagged with `-d spool` and the final argument is the output spool.

The `kvsp-bcat` command operates like `kvsp-bpub` (see below). It writes the binary
encoded spool content to standard output.

[[net_utilities]]
Network utilities
~~~~~~~~~~~~~~~~~

[width="90%",cols="10m,50m",grid="none",options="header"]
|===============================================================================
|command     | example
|kvsp-pub    | kvsp-pub -d spool tcp://192.168.1.9:1110
|kvsp-sub    | kvsp-sub -d spool tcp://192.168.1.9:1110
|kvsp-bpub   | kvsp-bpub -b cast.cfg -d spool tcp://192.168.1.9:2110
|kvsp-bsub   | kvsp-bsub -b cast.cfg -d spool tcp://192.168.1.9:2110
|kvsp-kkpub  | kvsp-kkpub -b kafka.host.name -t topic
|kvsp-tpub   | kvsp-tpub -b cast.cfg -d spool -p 2110
|===============================================================================

The network utilities keep a local spool continuously replicated to a remote spool.

kvsp-pub/kvsp-sub
^^^^^^^^^^^^^^^^^
The utilities `kvsp-pub` and `kvsp-sub` exist to publish a source spool to a remote spool.
This pair of utilities communicates in JSON over ZeroMQ.  The publisher listens on the
specified TCP port. The subscriber connects to it.  If more than one subscriber connects,
each receives a copy of the data. When no subscribers are connected, the publisher drops
data. However, when the `-s` flag is given to both `kvsp-pub` and `kvsp-sub`, two things
change: the publisher queues data when waiting for a subscriber instead of dropping it;
and secondly, if more than one subscriber connects, the data gets divided among them
rather than duplicated to all of them.

A subscriber can concentrate data (that is, "fan-in" the data) from many publishers,
simply by listing multiple ZeroMQ endpoints on the command line.

kvsp-bpub/kvsp-bsub
^^^^^^^^^^^^^^^^^^^
The `kvsp-bpub` and `kvsp-bsub` utilities implement binary-over-ZeroMQ replication. Their
usage is the same as `kvsp-pub` and `kvsp-sub` except the required `-b cast.cfg` argument.
This file lists the binary data type for each required key. Here's an example `cast.cfg`:

 i32 timestamp
 str sensor_name
 i8 temp

In this example, every dictionary in the source spool is expected to have three keys
(timestamp, sensor_name and temp); any other keys get ignored. Their values are parsed
to the binary types listed (i32, str and i8). The binary buffer is transmitted
without any keys. On the receiving subscriber, the binary is unpacked using the same file.
The reason for using binary publishing is speed- it's often an order of magnitude faster.
These data types may appear in `cast.cfg`

   i8        //  byte (8-bit int)
   i16       //  short (16-bit int)
   i32       //  integer (32-bit int)
   ipv4      //  dotted quad IP address
   str       //  string
   d64       //  double (64-bit float)

kvsp-kkpub
^^^^^^^^^^
Publishes the spool in JSON encoding to a Kakfa topic. This tool requires librdkakfa and
Jansson to be installed.

kvsp-tpub
^^^^^^^^^
Finally there is a "plain TCP" binary publisher. It has no subscriber counterpart yet, so
you have to code your own subscriber to use it. It takes a cast.cfg of the same form as above.
It listens on the specified port, and when a subscriber connects to it, the dictionaries
in the spool are transmitted as length-prefixed binary messages. The length prefix is a
32-bit  integer (host-endianness) specifying the message length that follows. The remaining
binary data is transmitted in host-endianness, except IP addresses in network order.

[[other_utilities]]
Other utilities
~~~~~~~~~~~~~~~

[width="90%",cols="10m,50m",grid="none",options="header"]
|===============================================================================
|command     | example
|kvsp-spr    | kvsp-spr -B 0 spool
|kvsp-spw    | kvsp-spw -i 10 spool
|kvsp-mod    | kvsp-mod -k key -o spool2 spool
|kvsp-speed  | kvsp-speed
|ramdisk     | ramdisk -c -s 1G /mnt/ramdisk
|===============================================================================

The `kvsp-spr` utility is used to manually read a spool and print its frames to the
screen. Normally it will block waiting for data once it reaches the end of the spool but
the `-B 0` (no-block) option tells it to stop reading if the end of the spool is reached.

The `kvsp-spw` utility is used only for testing; it writes a frame of data to the spool
(or several frames if the `-i` option is used with a count); in the latter mode there is a
sleep (delay) between each frame, which can be adjusted using the `-d <seconds>` option.

The `kvsp-mod` command "obfuscates" selected values from a source spool to hash values
in the output spool (named with the `-o` option). For each key named with the `-k` flag,
its value in the output spool is replaced with a mathematical hash.  The hash numbers
preserve consistency (so the same input value produces the same output value) but the
value itself is a meaningless number.

A simple benchmark is performed by the `kvsp-speed` utility to measure read and write
performance.

The `ramdisk` utility creates, queries or unmounts a ramdisk- a Linux tmpfs filesystem.
In the form shown in the table above it creates a 1G ramdisk on the `/mnt/ramdisk` mount
point (this directory must already exist). A ramdisk created this way will appear in the
`/proc/mounts` listing. If a ramdisk already exists on that mount point, the command does
nothing.  In create (`-c`) mode, the `-d <dir>` option may be used one or more times to
specify (as absolute paths) directories to create within the ramdisk.  Using `ramdisk -u
/mnt/ramdisk` unmounts it.  The `-q` option queries a directory to see if its a ramdisk
and show its size.  The `ramdisk` utility is included with kvspool because it is often
convenient to locate a spool on a ramdisk for performance.

API
---

C programs must be linked with -lkvspool.

[source,c]
  #include "kvspool.h"
  ...
  void *sp = kv_spoolwriter_new("spool");
  void *set = kv_set_new();
  kv_adds(set, "day", "Wednesday");
  kv_adds(set, "user", "Troy");
  kv_spool_write(sp,set);
  ...
  kv_set_free(set);
  kv_spoolwriter_free(sp);

In C, a spool is opened for writing by calling `kv_spoolwriter_new` which takes the spool
directory as argument. It returns an opaque handle (which you should eventually free with
`kv_spoolwriter_free`).

Kvspool provides a data structure that implements a dictionary in C. It is created using
`kv_set_new` which returns an opaque handle (which should eventually be freed using
`kv_set_free`). In the meantime, it can be used for the whole lifetime of the program, by
adding key-value pairs to it, writing them out, clearing it, and reusing it over and over.
To add a key-value pair, use `kv_adds` which takes the set handle, the key and the value.
The key and value get 'copied' into the set (the set does not keep a pointer to them).
To write the set into the spool, call `kv_spool_write` with the spool and set handle.
To re-use the set, call `kv_set_clear` with the set handle. For debugging you can dump a
set to stderr using `kv_set_dump(set,stderr)`.

To open a spool for reading, call `kv_spoolreader_new` which takes the spool directory and
returns an opaque handle to the spool.  Then call `kv_spool_read` to read the spool.

[source,c]
  void *sp = kv_spoolreader_new("spool");
  void *set = kv_set_new();
  kv_spool_read(sp,set,1);

The final argument to `kv_spool_read` specifies whether it should block if there is no
data ready in the spool. A positive return value means success (data was read from the
spool and it's been populated into the set).  A zero value means that non-blocking mode
was used, but no data is currently available in the spool.

A C program can iterate through all the key-value pairs in the result set like this:

[source,c]
  kv_t *kv = NULL;
  while ( (kv = kv_next(set, kv))) {
    printf("key is %s\n", kv->key);
    printf("value is %s\n", kv->val);
  }

You can also fetch a particular key-value pair from the set using `kv_get`, providing
the key as the final argument:

 kv_t *kv = kv_get(set, "user");

The number of key-value pairs in the set can be obtained using `kv_len`:

 int count = kv_len(set);

The C API also has a function to rewind the spool, which works if there is no reader that
has the spool open at the time. It takes the spool directory as its only argument.

 sp_reset(dir);

// vim: set tw=90 wm=2 syntax=asciidoc: