This new script attempts to generate some subreddits that are more like
production data. It first pulls down data from reddit.com, then uses
markov chains to generate new data for insertion into the databases.
* configuration now comes from the command line so it's easier to use
for multiple projects.
* the bucket is now an s3 url allowing a path prefix to be added to
files.
* authentication now comes from boto's credential system which allows
us to use IAM roles.
This takes our current config payload from 4700 bytes to 1700. The goal
is to reduce zookeeper network load during config changes as well as app
restarts during deploys.
This adds in two redirects - `event_click` and `event_redirect` - `event_click`
to allow appending in a user ID to an event before redirect, if we require one,
and `event_redirect` to service a local evented redirect, similar to ad clicks.
`event_click` is necessary for tracking clicks from users on embeds, which are
served via redditmedia, and therefore are always anonymous. When a user clicks
through, we want to know who they were and redirect them on their way. Because
of the way we're using nginx to store events as an access log right now, this
means we'll need to use two redirects: one to append the session ID and
another to store the event with the proper session ID.
Thanks to Nathanael A. Hoyle for the report! Some of these may have
been exploitable due to pointer arithmetic before reads / writes.
Just bail out if we can't allocate.
Double checking in the click app and in the processing scripts was
difficult. Just trust the click app and assume any request that got
a 302 response is valid.
Some advertisers set their ad's url to an intermediate tracker so
they can independently track clicks. This results in a series of
redirects like this:
reddit tracker > intermediate tracker > final destination
The ad's url is communicated to the reddit tracker through a query
parameter which is urlencoded on reddit.com and then unquoted when
being handled by the reddit tracker. This unquoting causes problems
if there is an intermediate tracker with its own query string
that needs to be urlencoded. This commit adds handling for those query
strings.
The queries are used to find the ids of PromoCampaign or Link objects
and we don't need the many (one per campaign per subreddit target per day)
PromotionWeights objects.
Ok, now I'm getting some angst in my commit messages like my
predecessors had. I understand now. It's a terrible burden. Why must
the calendar progress? Why must numbers increment? The world is
forever turning.
The future is here.
It is 2014.
Previously, the subreddit/domain and account precomputers were separate.
This merges the two and improves their portability in the process.
Because of the increased portability, the precomputer can now be added
to the install script by default.
This attribute can serve as a handy indicator that a user is a moderator
somewhere and can therefore replace the more costly modship lookup in
reddit_base.
This is intended to reduce the number of critical secrets stored in the
INI file. An initial subset of secrets is moved into the vault to test
things out.