reddit

mirror of https://github.com/reddit-archive/reddit.git synced 2026-04-27 03:00:12 -04:00

Author	SHA1	Message	Date
Brian Simpson	1d0dda280a	compute_time_listings: Report exceptions to Sentry	2016-10-19 16:34:22 -07:00
David King	c8f10bb7b8	Parallelise parts of mr_top jobs Explanation: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now: date: Sun Jan 17 20:04:56 PST 2016 -rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump -rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump -rw-rw-r-- 1 ri ri 53G Jan 16 07:13 comment-year-data.dump -rw-rw-r-- 1 ri ri 31G Jan 16 04:37 comment-year-thing.dump -rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump -rw-rw-r-- 1 ri ri 70M Jan 17 17:03 link-week-thing.dump So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times. The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work. The approach: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.) Numbers: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours. Downsides: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs). Further work: with this we could easily do sort\|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.	2016-02-18 15:35:58 -08:00
David King	f63050fb14	Make the mr_top reducer run sort(1) with LC_ALL=C By default, Linux's sort(1) uses locale-based sorting. Normally this is what humans want, but for mapreduce it breaks the guarantee that the same reducer always sees each instance of the same key. Here's an example: user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9 user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8 user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1 user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50 user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2 user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3 user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe According to sort(1) using the default locale, this is already sorted. Unfortunately, that means that to a reducer this list represents 6 different listings (each of which will overwrite the previous runs of the same listing). But that's not what we want. It's actually two listings, like: user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9 user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8 user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50 user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2 user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3 user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1 user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe To do this, we need to set the enviroment variable LC_ALL=C when running sort(1) to indicate that the sorting should operate only on the raw bytes. It looks like this has been broken since the Trusty Tahr upgrade.	2016-02-18 15:35:02 -08:00
David King	6e71efa726	Fix a bug in compute_time_listings that would allow simultaneous runs The cause of the bug is that if we fail to start because someone else has already started, we still delete their files. From job-02 right now: write_permacache() # comment ("day", "week+ write_permacache() # link ("day","week") write_permacache() # link ("day","week") write_permacache() # comment ("day", "week+ write_permacache() # link ("month","year") write_permacache() # comment ("month", "ye+	2016-02-18 15:33:01 -08:00
David King	c5f26d235b	Speed up mr_top permacache writes by about 30% * Make them lockless, because mr_top is the only one that ever writes to it. This avoids a lot of memcached round trips * Don't set them to permacache_memcaches. Delete from it instead. This keeps us from blowing out that whole cache with listings that will never be read out of every time mr_top runs. * Fix a performance bug in _mr_tools.mr_reduce_max_per_key	2016-02-18 15:32:56 -08:00
xiongchiamiov	6ef290a1cf	Userpage: fix top listings for comments If you go to a userpage and sort by top (in either the overview or comments tabs), and restrict the time range to anything other than "all time", no comments will be shown. The data in these listings is built from functions in `lib/db/queries.py` (specifically from `get_comments()` down). This ends up trying to pull the query results from permacache (in `CachedResults.fetch_multi()`), defaulting to an empty list if no cache entry is found. Now, the cache entry is supposed to be populated periodically by a cronjob that calls `scripts/compute_time_listings`. This script (and its Python helpers in `lib/mr_top.py` and `lib/mr_tools/`) generates a dump of data from Postgresql, then reads through that and builds up entries to insert into the cache. As with many scripts of this sort, it expects to get in some bad data, and so performs some basic sanity checks. The problem is that the sanity checks have been throwing out all comments. With no new comments, there's nothing new to put into the cache! The root of this was a refactoring in reddit/reddit@3511b08 that combined several different scripts that were doing similar things. Unfortunately, we ended up requiring the `url` field on comments, which doesn't exist because, well, comments aren't links. Now we have two sets of fields that we expect to get, one for comments and one for links, and all is good. We also now have a one-line summary of processed/skipped entries printed out, which will help to make a problem like this more obvious in the future.	2015-04-30 15:53:33 -07:00
Neil Williams	af09fa8dee	Update license headers to 2015. The highlight of each year for me.	2015-01-08 13:35:03 -08:00
umbrae	c0bff7498b	Support 'all' in compute_time_listings	2014-07-17 13:03:13 -07:00
Chad Birch	7b24dacd77	compute_time_listings MINID query: order by date	2014-02-26 11:44:08 -08:00
Neil Williams	3511b08110	Combine and generalize the time listing precomputer scripts. Previously, the subreddit/domain and account precomputers were separate. This merges the two and improves their portability in the process. Because of the increased portability, the precomputer can now be added to the install script by default.	2014-02-13 13:50:52 -08:00

10 Commits