mirror of
https://github.com/reddit-archive/reddit.git
synced 2026-04-27 03:00:12 -04:00
**Explanation**: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now:
date: Sun Jan 17 20:04:56 PST 2016
-rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump
-rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump
-rw-rw-r-- 1 ri ri 53G Jan 16 07:13 comment-year-data.dump
-rw-rw-r-- 1 ri ri 31G Jan 16 04:37 comment-year-thing.dump
-rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump
-rw-rw-r-- 1 ri ri 70M Jan 17 17:03 link-week-thing.dump
So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times.
The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work.
**The approach**: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.)
**Numbers**: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours.
**Downsides**: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs).
**Further work**: with this we could easily do sort|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.
141 lines
4.2 KiB
Bash
Executable File
141 lines
4.2 KiB
Bash
Executable File
#!/bin/bash
|
|
# The contents of this file are subject to the Common Public Attribution
|
|
# License Version 1.0. (the "License"); you may not use this file except in
|
|
# compliance with the License. You may obtain a copy of the License at
|
|
# http://code.reddit.com/LICENSE. The License is based on the Mozilla Public
|
|
# License Version 1.1, but Sections 14 and 15 have been added to cover use of
|
|
# software over a computer network and provide for limited attribution for the
|
|
# Original Developer. In addition, Exhibit A has been modified to be consistent
|
|
# with Exhibit B.
|
|
#
|
|
# Software distributed under the License is distributed on an "AS IS" basis,
|
|
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for
|
|
# the specific language governing rights and limitations under the License.
|
|
#
|
|
# The Original Code is reddit.
|
|
#
|
|
# The Original Developer is the Initial Developer. The Initial Developer of
|
|
# the Original Code is reddit Inc.
|
|
#
|
|
# All portions of the code written by reddit are Copyright (c) 2006-2015 reddit
|
|
# Inc. All Rights Reserved.
|
|
###############################################################################
|
|
|
|
set -e
|
|
|
|
# expects two environment variables
|
|
# REDDIT_ROOT = path to the root of the reddit public code; the directory with the Makefile
|
|
# REDDIT_INI = path to the ini file to use
|
|
# which should be supplied via:
|
|
source /etc/default/reddit
|
|
# additionally, some configuration can be overridden in the environment
|
|
export TMPDIR=${TMPDIR:-/tmp}
|
|
export PGUSER=${PGUSER:-reddit}
|
|
export PGHOST=${PGHOST:-localhost}
|
|
|
|
## command line args
|
|
# one of "link" or "comment"
|
|
export THING_CLS="$1"
|
|
# period of data to extract from postgres: e.g. "hour", "week", "year", "all"
|
|
export INTERVAL="$2"
|
|
# which period listings to update.
|
|
# formatted as python tuple of strings: e.g. '("hour",)' or ("week", "all",) etc
|
|
export TIMES="$3"
|
|
|
|
echo "Starting $THING_CLS processing"
|
|
|
|
THING_DUMP=$TMPDIR/$THING_CLS-$INTERVAL-thing.dump
|
|
DATA_DUMP=$TMPDIR/$THING_CLS-$INTERVAL-data.dump
|
|
function clean_up {
|
|
rm -f $THING_DUMP $DATA_DUMP
|
|
}
|
|
|
|
if [ -e $THING_DUMP ]; then
|
|
echo cannot start because $THING_DUMP exists
|
|
ls -l $THING_DUMP
|
|
exit 1
|
|
fi
|
|
touch $THING_DUMP
|
|
|
|
# since we're in charge of this run now, we have to clean up afterwards
|
|
trap clean_up EXIT
|
|
|
|
function run_query {
|
|
psql -F"\t" -A -t -c "$1"
|
|
}
|
|
|
|
function mrsort {
|
|
LC_ALL=C sort -S200m
|
|
}
|
|
|
|
function reddit {
|
|
reddit_usage() {
|
|
echo "reddit: [-jN] cmd..." 2>&1
|
|
exit
|
|
}
|
|
|
|
local OPTIND o njobs
|
|
|
|
njobs=1
|
|
|
|
while getopts ":j:" o; do
|
|
case "${o}" in
|
|
j)
|
|
njobs="${OPTARG}"
|
|
;;
|
|
*)
|
|
reddit_usage
|
|
;;
|
|
esac
|
|
done
|
|
shift $((OPTIND-1))
|
|
|
|
cmd="paster --plugin=r2 run $REDDIT_INI $REDDIT_ROOT/r2/lib/mr_top.py -c \"$@ # $THING_CLS $INTERVAL $TIMES\""
|
|
|
|
if [ "$njobs" = "1" ]; then
|
|
sh -c "$cmd" # just execute it directly
|
|
else
|
|
$REDDIT_ROOT/../scripts/hashdist.py -n"$njobs" -- sh -c "$cmd"
|
|
fi
|
|
}
|
|
|
|
# Hack to let pg fetch all things with intervals
|
|
if [ $INTERVAL = "all" ]; then
|
|
export INTERVAL="century"
|
|
fi
|
|
|
|
MINID=$(run_query "SELECT thing_id
|
|
FROM reddit_thing_$THING_CLS
|
|
WHERE
|
|
date > now() - interval '1 $INTERVAL' AND
|
|
date < now()
|
|
ORDER BY date
|
|
LIMIT 1")
|
|
if [ -z $MINID ]; then
|
|
echo \$MINID is empty. Replication is likely behind.
|
|
exit 1
|
|
fi
|
|
|
|
run_query "\\copy (SELECT thing_id, 'thing', '$THING_CLS', ups, downs, deleted, spam, extract(epoch from date)
|
|
FROM reddit_thing_$THING_CLS
|
|
WHERE
|
|
not deleted AND
|
|
thing_id >= $MINID
|
|
) to $THING_DUMP"
|
|
|
|
run_query "\\copy (SELECT thing_id, 'data', '$THING_CLS', key, value
|
|
FROM reddit_data_$THING_CLS
|
|
WHERE
|
|
key IN ('url', 'sr_id', 'author_id') AND
|
|
thing_id >= $MINID
|
|
) to $DATA_DUMP"
|
|
|
|
cat $THING_DUMP $DATA_DUMP |
|
|
mrsort |
|
|
reddit "join_things('$THING_CLS')" |
|
|
reddit "time_listings($TIMES, '$THING_CLS')" |
|
|
mrsort |
|
|
reddit -j4 "write_permacache()"
|
|
|
|
echo 'Done.'
|