145 Commits

Author SHA1 Message Date
Brian Simpson
1d0dda280a compute_time_listings: Report exceptions to Sentry 2016-10-19 16:34:22 -07:00
Brian Simpson
fd17945edf Delete log_q 2016-10-04 13:45:11 -07:00
David Wick
faac2af02f tracker.py: Ensure /click endpoint doesn't modify destination urls
Omitting `keep_blank_values` was dropping blank query parameters.
Furthermore, converting the output of `parse_qsl` to a dictionary
was unnecessarily modifying the order of parameters since dicts
are not ordered. Fortunately `urllib.urlencode` also accepts a
sequence of two-element tuples and the order of parameters in
the encoded string will match the order of parameter tuples in the
sequence.
2016-09-13 15:34:41 -07:00
Joshua Uziel
805a9a4b79 Onboarding: Featured subreddits API endpoint
Endpoint for retrieving subreddits to make new
users aware of without actually subscribing
them while doing the onboarding process.

The list of featured subreddits will be managed
separately from the default ones.

For testing purposes, we add a subreddit to show
up via the inject_test_data script.
2016-06-20 16:19:37 -07:00
Brian Simpson
a4bf06235c TryLater: move all processing and cleanup into class method
Previously it was split between the class and the script handler
2016-06-15 11:18:59 -07:00
Brian Simpson
4292df773f TryLater: Use xget to process ALL ready items 2016-06-15 11:18:59 -07:00
Brian Simpson
d4cb4087b5 run_trylater: Flush the stats 2016-06-15 11:18:59 -07:00
Greg Taylor
31b83a3350 Incorporating PEP8 checking on PRs. 2016-06-15 11:10:21 -07:00
Jack Niu
734433664d support PM delete opration on Recipient side only 2016-05-18 14:53:25 -07:00
zeantsoi
0e45ccefe0 Implement ads auction
Complete iteration of our transition to an auction model. Non-sponsor
users can only create auction campaigns; sponsors can create either.
All other users are forced to create legacy fixed CPM campaigns.
2016-05-04 10:43:20 -07:00
David King
bdab877b2e Fix tuples_to_sstables for Cassandra 1.2
* The constructor for `SSTableSimpleUnsortedWriter` changed
* `sstableloader` changed its directory structure. Now we just don't manage it at all in `tuples_to_sstables` and make the caller do the work
2016-03-03 14:43:06 -08:00
David King
c8f10bb7b8 Parallelise parts of mr_top jobs
**Explanation**: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now:

    date: Sun Jan 17 20:04:56 PST 2016
    -rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump
    -rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump
    -rw-rw-r-- 1 ri ri  53G Jan 16 07:13 comment-year-data.dump
    -rw-rw-r-- 1 ri ri  31G Jan 16 04:37 comment-year-thing.dump
    -rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump
    -rw-rw-r-- 1 ri ri  70M Jan 17 17:03 link-week-thing.dump

So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times.

The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work.

**The approach**: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.)

**Numbers**: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours.

**Downsides**: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs).

**Further work**: with this we could easily do sort|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.
2016-02-18 15:35:58 -08:00
David King
f63050fb14 Make the mr_top reducer run sort(1) with LC_ALL=C
By default, Linux's sort(1) uses locale-based sorting. Normally this is what
humans want, but for mapreduce it breaks the guarantee that the same reducer
always sees each instance of the same key. Here's an example:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

According to sort(1) using the default locale, this is already sorted.
Unfortunately, that means that to a reducer this list represents 6 different
listings (each of which will overwrite the previous runs of the same listing).
But that's not what we want. It's actually two listings, like:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

To do this, we need to set the enviroment variable LC_ALL=C when running sort(1)
to indicate that the sorting should operate only on the raw bytes.

It looks like this has been broken since the Trusty Tahr upgrade.
2016-02-18 15:35:02 -08:00
David King
6e71efa726 Fix a bug in compute_time_listings that would allow simultaneous runs
The cause of the bug is that if we fail to start because someone else has already started, we still delete their files.

From job-02 right now:

    write_permacache() # comment ("day", "week+
    write_permacache() # link ("day","week")
    write_permacache() # link ("day","week")
    write_permacache() # comment ("day", "week+
    write_permacache() # link ("month","year")
    write_permacache() # comment ("month", "ye+
2016-02-18 15:33:01 -08:00
David King
c5f26d235b Speed up mr_top permacache writes by about 30%
* Make them lockless, because mr_top is the only one that ever writes to it.
  This avoids a lot of memcached round trips
* Don't set them to permacache_memcaches. Delete from it instead.
  This keeps us from blowing out that whole cache with listings that will never
  be read out of every time mr_top runs.
* Fix a performance bug in _mr_tools.mr_reduce_max_per_key
2016-02-18 15:32:56 -08:00
Brian Simpson
051a934e7c inject_test_data: Set LocalizedDefaultSubreddits. 2015-12-02 13:32:42 -08:00
Ricky Ramirez
b4e0849eed geoip: Update paths to databases.
geoipupdate 1.6.6 now drops files into /var/lib/GeoIP
2015-11-19 16:03:01 -08:00
David Wick
db34bcaf40 Ads: Do not unescape click urls 2015-11-16 15:49:59 -08:00
Brian Simpson
91e1221d83 inject_test_data: Make sure promos subreddit exists. 2015-11-12 15:32:12 -08:00
Chad Birch
0c637151b5 Major rewrite/refactor of voting code
This is a rewrite of much of the code related to voting. Some of the
improvements include:

- Detangling the whole process of creating and processing votes
- Creating an actual Vote class to use instead of dealing with
  inconsistent ad-hoc voting data everywhere
- More consistency with naming and other similar things like vote
  directions (previously had True/False/None in some places,
  1/-1/0 in others, etc.)
- More flexible methods in determining and applying the effects of votes
- Improvement of modhash generation/validation
- Removing various obsolete/unnecessary code
2015-11-09 17:57:42 -07:00
Chad Birch
8ce2db7386 Liveconfig diff: use pprint to generate dict reprs
The write_live_config script often gives really strange results when
trying to display diffs of changes to dict values, since the ordering of
a dict is not defined. So key/value pairs will sometimes be rearranged
between the old and new versions, creating a confusing diff.

This changes to use the pprint module to generate string representations
for dicts, because it sorts the dict by key before outputting it, so we
should get consistent representations that can be compared more easily.
2015-10-14 12:46:53 -06:00
Brian Simpson
d990533d0b Upgrade pylons from 0.9.7 to 1.0.
http://pylons-webframework.readthedocs.org/en/latest/upgrading.html

This requires several code changes:
* pylons `config` option must be explicitly passed during setup
* the pylons global has been renamed from `g` to `app_globals`
* the pylons global has been renamed from `c` to `tmpl_context`
* set pylons.strict_tmpl_context = False (instead of pylons.strict_c)
* redirect_to() has been swapped for redirect()
* must implement `ErrorDocuments` middleware ourselves

pylons 1.0 also required an upgrade of routes from 1.11 to 1.12. This
required the following changes:
* set Mapper.minimization = True (the default value changed)
* set Mapper.explicit = False (the default value changed)
2015-09-15 06:35:31 -04:00
Alexander Putilin
6889236c8c Fix inject_test_data.py after changes in Link._submit 2015-09-02 12:28:59 -07:00
Chad Birch
df68d45d68 write_live_config: check similarity when diffing
Turns out that SequenceMatcher is not quite as neat as it appears. When
diffing a string change like:

 old: "this is the old string value"
 new: "on"

It was displayed as:

 - "this is the "
 - "ld stri"
 - "g value"

Since the "o" and "n" were kept. This just displays it as a wholesale
key removal/addition unless the strings are at least 50% similar.
2015-08-20 17:32:51 -06:00
Chad Birch
650b7f7266 write_live_config: Show only changed settings
The script would previously dump out the entire new parsed live
configuration, which (now that we have a lot of fields in it) made it
difficult to find the ones that had actually changed. This fetches the
existing live config, then compares it to the new one and only outputs
any data that has changed for confirmation.
2015-08-20 14:05:06 -06:00
Dustin Pho
e3431123ab Fix magic number for SHA1 hash length 2015-07-30 04:03:19 -04:00
Matthew94
3469d62291 geoip_service: use dict comprehensions for clarity. 2015-07-30 03:44:11 -04:00
Brian Simpson
d846f0609d Update SubscribedSubredditsByAccount backfill.
There's no index on SRMember.c.rel_id so instead sort the query
by SRMember.c.thing2_id (the user's id). Also the timestamp for
writing to C* is updated to an integer timestamp corresponding to
when the dual write was deployed.
2015-06-30 18:41:09 -04:00
Brian Simpson
fb8ca7fcbb Dual write SubscribedSubredditsByAccount and add backfill script. 2015-06-30 18:41:08 -04:00
xiongchiamiov
0702bc4886 Image previews: fix old urls
We've changed the url structure of image previews a number of times, which
breaks everything uploaded prior to the latest version.  This script should
find all preview images that have been uploaded thus far, move them to the
appropriate place, and save an updated and correct storage url in every Link
that uses them.
2015-05-26 11:47:56 -07:00
Jordan Milne
644f0988d7 Add a shim for requests' response.json to ease upgrading to 2.x
See http://docs.python-requests.org/en/latest/api/#migrating-to-1-x
for the rationale,

`.json()` also differs from `.json` in that it `raise`s instead of
returning `None` on a decoding error, but that shouldn't affect us
anywhere.

Conflicts:
	r2/r2/lib/media.py
2015-05-15 13:28:44 -07:00
Brian Simpson
91699b7acd Add backfill script for CommentScoresByLink. 2015-05-12 08:44:11 -04:00
umbrae
03bc77b203 Beta mode: Add preference and subreddit callouts 2015-05-07 10:57:47 -07:00
MelissaCole
f05603ad98 Add backfill script for num_gildings
This script will update Account.num_gildings for all gildings in the gold_table
(which is where trans_id like 'X%' in the gold table).
2015-05-06 11:10:39 -07:00
xiongchiamiov
6ef290a1cf Userpage: fix top listings for comments
If you go to a userpage and sort by top (in either the overview or comments
tabs), and restrict the time range to anything other than "all time", no
comments will be shown.

The data in these listings is built from functions in `lib/db/queries.py`
(specifically from `get_comments()` down).  This ends up trying to pull the
query results from permacache (in `CachedResults.fetch_multi()`), defaulting to
an empty list if no cache entry is found.

Now, the cache entry is supposed to be populated periodically by a cronjob that
calls `scripts/compute_time_listings`.  This script (and its Python helpers in
`lib/mr_top.py` and `lib/mr_tools/`) generates a dump of data from Postgresql,
then reads through that and builds up entries to insert into the cache.  As
with many scripts of this sort, it expects to get in some bad data, and so
performs some basic sanity checks.

The problem is that the sanity checks have been throwing out all comments.
With no new comments, there's nothing new to put into the cache!

The root of this was a refactoring in reddit/reddit@3511b08 that combined
several different scripts that were doing similar things.  Unfortunately, we
ended up requiring the `url` field on comments, which doesn't exist because,
well, comments aren't links.

Now we have two sets of fields that we expect to get, one for comments and one
for links, and all is good.

We also now have a one-line summary of processed/skipped entries printed out,
which will help to make a problem like this more obvious in the future.
2015-04-30 15:53:33 -07:00
zeantsoi
7f1aa2a520 Command line script to add subreddit to collection 2015-04-15 17:04:46 -07:00
Chad Birch
0fbea80d45 Integrate AutoModerator into the site 2015-03-31 14:56:19 -06:00
xiongchiamiov
fb507e59a6 TryLater: use more generic parameter name
`mature_items` made sense in the original context, but now that I'm stealing it
for other uses it's really just some set of some sort of data.
2015-03-03 15:50:04 -08:00
Jordan Milne
28a913c242 Add backfill script for deleted user accounts 2015-03-03 14:26:22 -08:00
Neil Williams
274a8d7008 Overhaul populatedb script.
This new script attempts to generate some subreddits that are more like
production data.  It first pulls down data from reddit.com, then uses
markov chains to generate new data for insertion into the databases.
2015-03-02 14:44:57 -08:00
Neil Williams
1605b48eda upload_static_files_to_s3: Improve logging clarity.
The goal is to make it seem less like the build is "hanging" during the
upload step.
2015-03-02 10:41:07 -08:00
Neil Williams
30e2256fdd upload_static_files_to_s3: Remove vestigial support for gzipped statics.
This feature was never really used and the core support for it was torn
out in 68857e1a7d.
2015-03-02 10:41:07 -08:00
Neil Williams
64a469e64e upload_static_files_to_s3: Rework to use command line arguments.
* configuration now comes from the command line so it's easier to use
  for multiple projects.
* the bucket is now an s3 url allowing a path prefix to be added to
  files.
* authentication now comes from boto's credential system which allows
  us to use IAM roles.
2015-03-02 10:41:07 -08:00
Neil Williams
4090acd8d8 Remove unused static file cleaner.
It never really worked right and is getting in the way now.
2015-03-02 10:41:07 -08:00
Neil Williams
f501106d9c zookeeper: gzip live config payload.
This takes our current config payload from 4700 bytes to 1700. The goal
is to reduce zookeeper network load during config changes as well as app
restarts during deploys.
2015-03-02 10:41:07 -08:00
Neil Williams
f00e90f0ed zookeeper: Remove obsolete LiveDict class.
This is no longer used since the relevant consumers have switched to
Cassandra as a backing store.
2015-03-02 10:41:07 -08:00
umbrae
0850bd3044 Tracker: URL decode session cookie 2015-02-20 23:58:51 -08:00
umbrae
ea5aa9c538 Tracker: add domain prefix to redirect domain 2015-02-20 21:49:46 -08:00
umbrae
76fb41a3f8 Tracker: add session tracking redirector
This adds in two redirects - `event_click` and `event_redirect` - `event_click`
to allow appending in a user ID to an event before redirect, if we require one,
and `event_redirect` to service a local evented redirect, similar to ad clicks.

`event_click` is necessary for tracking clicks from users on embeds, which are
served via redditmedia, and therefore are always anonymous. When a user clicks
through, we want to know who they were and redirect them on their way. Because
of the way we're using nginx to store events as an access log right now, this
means we'll need to use two redirects: one to append the session ID and
another to store the event with the proper session ID.
2015-02-20 21:23:09 -08:00
John-William Trenholm
9e64e56960 tracker.py: use env var for configuration file
Upstart needs to use a environment variable to determine the
configuration file.
2015-02-12 16:00:04 -08:00