Files
reddit/install/reddit.sh
David King c8f10bb7b8 Parallelise parts of mr_top jobs
**Explanation**: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now:

    date: Sun Jan 17 20:04:56 PST 2016
    -rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump
    -rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump
    -rw-rw-r-- 1 ri ri  53G Jan 16 07:13 comment-year-data.dump
    -rw-rw-r-- 1 ri ri  31G Jan 16 04:37 comment-year-thing.dump
    -rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump
    -rw-rw-r-- 1 ri ri  70M Jan 17 17:03 link-week-thing.dump

So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times.

The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work.

**The approach**: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.)

**Numbers**: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours.

**Downsides**: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs).

**Further work**: with this we could easily do sort|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.
2016-02-18 15:35:58 -08:00

644 lines
18 KiB
Bash
Executable File

#!/bin/bash
###############################################################################
# reddit dev environment installer
# --------------------------------
# This script installs a reddit stack suitable for development. DO NOT run this
# on a system that you use for other purposes as it might delete important
# files, truncate your databases, and otherwise do mean things to you.
#
# By default, this script will install the reddit code in the current user's
# home directory and all of its dependencies (including libraries and database
# servers) at the system level. The installed reddit will expect to be visited
# on the domain "reddit.local" unless specified otherwise. Configuring name
# resolution for the domain is expected to be done outside the installed
# environment (e.g. in your host machine's /etc/hosts file) and is not
# something this script handles.
#
# Several configuration options (listed in the "Configuration" section below)
# are overridable with environment variables. e.g.
#
# sudo REDDIT_DOMAIN=example.com ./install/reddit.sh
#
###############################################################################
# load configuration
RUNDIR=$(dirname $0)
source $RUNDIR/install.cfg
###############################################################################
# Sanity Checks
###############################################################################
if [[ $EUID -ne 0 ]]; then
echo "ERROR: Must be run with root privileges."
exit 1
fi
if [[ -z "$REDDIT_USER" ]]; then
# in a production install, you'd want the code to be owned by root and run
# by a less privileged user. this script is intended to build a development
# install, so we expect the owner to run the app and not be root.
cat <<END
ERROR: You have not specified a user. This usually means you're running this
script directly as root. It is not recommended to run reddit as the root user.
Please create a user to run reddit and set the REDDIT_USER variable
appropriately.
END
exit 1
fi
if [[ "amd64" != $(dpkg --print-architecture) ]]; then
cat <<END
ERROR: This host is running the $(dpkg --print-architecture) architecture!
Because of the pre-built dependencies in our PPA, and some extra picky things
like ID generation in liveupdate, installing reddit is only supported on amd64
architectures.
END
exit 1
fi
# seriously! these checks are here for a reason. the packages from the
# reddit ppa aren't built for anything but trusty (14.04) right now, so
# if you try and use this install script on another release you're gonna
# have a bad time.
source /etc/lsb-release
if [ "$DISTRIB_ID" != "Ubuntu" -o "$DISTRIB_RELEASE" != "14.04" ]; then
echo "ERROR: Only Ubuntu 14.04 is supported."
exit 1
fi
if [[ "2000000" -gt $(awk '/MemTotal/{print $2}' /proc/meminfo) ]]; then
LOW_MEM_PROMPT="reddit requires at least 2GB of memory to work properly, continue anyway? [y/n] "
read -er -n1 -p "$LOW_MEM_PROMPT" response
if [[ "$response" != "y" ]]; then
echo "Quitting."
exit 1
fi
fi
###############################################################################
# Install prerequisites
###############################################################################
# install primary packages
$RUNDIR/install_apt.sh
# install cassandra from datastax
$RUNDIR/install_cassandra.sh
# install services (rabbitmq, postgres, memcached, etc.)
$RUNDIR/install_services.sh
###############################################################################
# Install the reddit source repositories
###############################################################################
if [ ! -d $REDDIT_SRC ]; then
mkdir -p $REDDIT_SRC
chown $REDDIT_USER $REDDIT_SRC
fi
function clone_reddit_repo {
local destination=$REDDIT_SRC/${1}
local repository_url=https://github.com/${2}.git
if [ ! -d $destination ]; then
sudo -u $REDDIT_USER -H git clone $repository_url $destination
fi
if [ -d $destination/upstart ]; then
cp $destination/upstart/* /etc/init/
fi
}
function clone_reddit_plugin_repo {
clone_reddit_repo $1 reddit/reddit-plugin-$1
}
clone_reddit_repo reddit reddit/reddit
clone_reddit_repo i18n reddit/reddit-i18n
for plugin in $REDDIT_PLUGINS; do
clone_reddit_plugin_repo $plugin
done
###############################################################################
# Configure Services
###############################################################################
# Configure Cassandra
$RUNDIR/setup_cassandra.sh
# Configure PostgreSQL
$RUNDIR/setup_postgres.sh
# Configure mcrouter
$RUNDIR/setup_mcrouter.sh
# Configure RabbitMQ
$RUNDIR/setup_rabbitmq.sh
###############################################################################
# Install and configure the reddit code
###############################################################################
function install_reddit_repo {
cd $REDDIT_SRC/$1
sudo -u $REDDIT_USER python setup.py build
python setup.py develop --no-deps
}
install_reddit_repo reddit/r2
install_reddit_repo i18n
for plugin in $REDDIT_PLUGINS; do
install_reddit_repo $plugin
done
# generate binary translation files from source
cd $REDDIT_SRC/i18n/
sudo -u $REDDIT_USER make clean all
# this builds static files and should be run *after* languages are installed
# so that the proper language-specific static files can be generated and after
# plugins are installed so all the static files are available.
cd $REDDIT_SRC/reddit/r2
sudo -u $REDDIT_USER make clean all
plugin_str=$(echo -n "$REDDIT_PLUGINS" | tr " " ,)
if [ ! -f development.update ]; then
cat > development.update <<DEVELOPMENT
# after editing this file, run "make ini" to
# generate a new development.ini
[DEFAULT]
# global debug flag -- displays pylons stacktrace rather than 500 page on error when true
# WARNING: a pylons stacktrace allows remote code execution. Make sure this is false
# if your server is publicly accessible.
debug = true
disable_ads = true
disable_captcha = true
disable_ratelimit = true
disable_require_admin_otp = true
page_cache_time = 0
domain = $REDDIT_DOMAIN
oauth_domain = $REDDIT_DOMAIN
plugins = $plugin_str
media_provider = filesystem
media_fs_root = /srv/www/media
media_fs_base_url_http = http://%(domain)s/media/
[server:main]
port = 8001
DEVELOPMENT
chown $REDDIT_USER development.update
else
sed -i "s/^plugins = .*$/plugins = $plugin_str/" $REDDIT_SRC/reddit/r2/development.update
sed -i "s/^domain = .*$/domain = $REDDIT_DOMAIN/" $REDDIT_SRC/reddit/r2/development.update
sed -i "s/^oauth_domain = .*$/oauth_domain = $REDDIT_DOMAIN/" $REDDIT_SRC/reddit/r2/development.update
fi
sudo -u $REDDIT_USER make ini
if [ ! -L run.ini ]; then
sudo -u $REDDIT_USER ln -nsf development.ini run.ini
fi
###############################################################################
# some useful helper scripts
###############################################################################
function helper-script() {
cat > $1
chmod 755 $1
}
helper-script /usr/local/bin/reddit-run <<REDDITRUN
#!/bin/bash
exec paster --plugin=r2 run $REDDIT_SRC/reddit/r2/run.ini "\$@"
REDDITRUN
helper-script /usr/local/bin/reddit-shell <<REDDITSHELL
#!/bin/bash
exec paster --plugin=r2 shell $REDDIT_SRC/reddit/r2/run.ini
REDDITSHELL
helper-script /usr/local/bin/reddit-start <<REDDITSTART
#!/bin/bash
initctl emit reddit-start
REDDITSTART
helper-script /usr/local/bin/reddit-stop <<REDDITSTOP
#!/bin/bash
initctl emit reddit-stop
REDDITSTOP
helper-script /usr/local/bin/reddit-restart <<REDDITRESTART
#!/bin/bash
initctl emit reddit-restart TARGET=${1:-all}
REDDITRESTART
helper-script /usr/local/bin/reddit-flush <<REDDITFLUSH
#!/bin/bash
echo flush_all | nc localhost 11211
REDDITFLUSH
helper-script /usr/local/bin/reddit-serve <<REDDITSERVE
#!/bin/bash
exec paster serve --reload $REDDIT_SRC/reddit/r2/run.ini
REDDITSERVE
###############################################################################
# pixel and click server
###############################################################################
mkdir -p /var/opt/reddit/
chown $REDDIT_USER:$REDDIT_GROUP /var/opt/reddit/
mkdir -p /srv/www/pixel
chown $REDDIT_USER:$REDDIT_GROUP /srv/www/pixel
cp $REDDIT_SRC/reddit/r2/r2/public/static/pixel.png /srv/www/pixel
if [ ! -f /etc/gunicorn.d/click.conf ]; then
cat > /etc/gunicorn.d/click.conf <<CLICK
CONFIG = {
"mode": "wsgi",
"working_dir": "$REDDIT_SRC/reddit/scripts",
"user": "$REDDIT_USER",
"group": "$REDDIT_USER",
"args": (
"--bind=unix:/var/opt/reddit/click.sock",
"--workers=1",
"tracker:application",
),
}
CLICK
fi
service gunicorn start
###############################################################################
# nginx
###############################################################################
mkdir -p /srv/www/media
chown $REDDIT_USER:$REDDIT_GROUP /srv/www/media
cat > /etc/nginx/sites-available/reddit-media <<MEDIA
server {
listen 9000;
expires max;
location /media/ {
alias /srv/www/media/;
}
}
MEDIA
cat > /etc/nginx/sites-available/reddit-pixel <<PIXEL
upstream click_server {
server unix:/var/opt/reddit/click.sock fail_timeout=0;
}
server {
listen 8082;
log_format directlog '\$remote_addr - \$remote_user [\$time_local] '
'"\$request_method \$request_uri \$server_protocol" \$status \$body_bytes_sent '
'"\$http_referer" "\$http_user_agent"';
access_log /var/log/nginx/traffic/traffic.log directlog;
location / {
rewrite ^/pixel/of_ /pixel.png;
add_header Last-Modified "";
add_header Pragma "no-cache";
expires -1;
root /srv/www/pixel/;
}
location /click {
proxy_pass http://click_server;
}
}
PIXEL
cat > /etc/nginx/sites-available/reddit-ssl <<SSL
map \$http_upgrade \$connection_upgrade {
default upgrade;
'' close;
}
server {
listen 443;
ssl on;
ssl_certificate /etc/ssl/certs/ssl-cert-snakeoil.pem;
ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers EECDH+AES128:RSA+AES128:EECDH+AES256:RSA+AES256:EECDH+3DES:RSA+3DES:!MD5;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:1m;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host \$http_host;
proxy_http_version 1.1;
proxy_set_header X-Forwarded-For \$remote_addr;
proxy_pass_header Server;
# allow websockets through if desired
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection \$connection_upgrade;
}
}
SSL
# remove the default nginx site that may conflict with haproxy
rm -rf /etc/nginx/sites-enabled/default
# put our config in place
ln -nsf /etc/nginx/sites-available/reddit-media /etc/nginx/sites-enabled/
ln -nsf /etc/nginx/sites-available/reddit-pixel /etc/nginx/sites-enabled/
ln -nsf /etc/nginx/sites-available/reddit-ssl /etc/nginx/sites-enabled/
# make the pixel log directory
mkdir -p /var/log/nginx/traffic
# link the ini file for the Flask click tracker
ln -nsf $REDDIT_SRC/reddit/r2/development.ini $REDDIT_SRC/reddit/scripts/production.ini
service nginx restart
###############################################################################
# haproxy
###############################################################################
if [ -e /etc/haproxy/haproxy.cfg ]; then
BACKUP_HAPROXY=$(mktemp /etc/haproxy/haproxy.cfg.XXX)
echo "Backing up /etc/haproxy/haproxy.cfg to $BACKUP_HAPROXY"
cat /etc/haproxy/haproxy.cfg > $BACKUP_HAPROXY
fi
# make sure haproxy is enabled
cat > /etc/default/haproxy <<DEFAULT
ENABLED=1
DEFAULT
# configure haproxy
cat > /etc/haproxy/haproxy.cfg <<HAPROXY
global
maxconn 350
frontend frontend
mode http
bind 0.0.0.0:80
bind 127.0.0.1:8080
timeout client 24h
option forwardfor except 127.0.0.1
option httpclose
# make sure that requests have x-forwarded-proto: https iff tls
reqidel ^X-Forwarded-Proto:.*
acl is-ssl dst_port 8080
reqadd X-Forwarded-Proto:\ https if is-ssl
# send websockets to sutro
acl is-websocket hdr(Upgrade) -i WebSocket
use_backend sutro if is-websocket
# send media stuff to the local nginx
acl is-media path_beg /media/
use_backend media if is-media
# send pixel stuff to local nginx
acl is-pixel path_beg /pixel/
acl is-click path_beg /click
use_backend pixel if is-pixel || is-click
default_backend reddit
backend reddit
mode http
timeout connect 4000
timeout server 30000
timeout queue 60000
balance roundrobin
server app01-8001 localhost:8001 maxconn 30
backend sutro
mode http
timeout connect 4s
timeout server 24h
balance roundrobin
server sutro localhost:8002 maxconn 250
backend media
mode http
timeout connect 4000
timeout server 30000
timeout queue 60000
balance roundrobin
server nginx localhost:9000 maxconn 20
backend pixel
mode http
timeout connect 4000
timeout server 30000
timeout queue 60000
balance roundrobin
server nginx localhost:8082 maxconn 20
HAPROXY
# this will start it even if currently stopped
service haproxy restart
###############################################################################
# sutro (websocket server)
###############################################################################
if [ ! -f /etc/sutro.ini ]; then
cat > /etc/sutro.ini <<SUTRO
[app:main]
paste.app_factory = sutro.app:make_app
amqp.host = localhost
amqp.port = 5672
amqp.vhost = /
amqp.username = reddit
amqp.password = reddit
web.allowed_origins = $REDDIT_DOMAIN
web.mac_secret = YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXowMTIzNDU2Nzg5
web.ping_interval = 300
stats.host =
stats.port = 0
[server:main]
use = egg:gunicorn#main
worker_class = sutro.socketserver.SutroWorker
workers = 1
worker_connections = 250
host = 127.0.0.1
port = 8002
graceful_timeout = 5
forward_allow_ips = 127.0.0.1
[loggers]
keys = root
[handlers]
keys = syslog
[formatters]
keys = generic
[logger_root]
level = INFO
handlers = syslog
[handler_syslog]
class = handlers.SysLogHandler
args = ("/dev/log", "local7")
formatter = generic
level = NOTSET
[formatter_generic]
format = [%(name)s] %(message)s
SUTRO
fi
if [ ! -f /etc/init/sutro.conf ]; then
cat > /etc/init/sutro.conf << UPSTART_SUTRO
description "sutro websocket server"
stop on runlevel [!2345]
start on runlevel [2345]
respawn
respawn limit 10 5
kill timeout 15
limit nofile 65535 65535
exec gunicorn_paster /etc/sutro.ini
UPSTART_SUTRO
fi
service sutro restart
###############################################################################
# geoip service
###############################################################################
if [ ! -f /etc/gunicorn.d/geoip.conf ]; then
cat > /etc/gunicorn.d/geoip.conf <<GEOIP
CONFIG = {
"mode": "wsgi",
"working_dir": "$REDDIT_SRC/reddit/scripts",
"user": "$REDDIT_USER",
"group": "$REDDIT_USER",
"args": (
"--bind=127.0.0.1:5000",
"--workers=1",
"--limit-request-line=8190",
"geoip_service:application",
),
}
GEOIP
fi
service gunicorn start
###############################################################################
# Job Environment
###############################################################################
CONSUMER_CONFIG_ROOT=$REDDIT_HOME/consumer-count.d
if [ ! -f /etc/default/reddit ]; then
cat > /etc/default/reddit <<DEFAULT
export REDDIT_ROOT=$REDDIT_SRC/reddit/r2
export REDDIT_INI=$REDDIT_SRC/reddit/r2/run.ini
export REDDIT_USER=$REDDIT_USER
export REDDIT_GROUP=$REDDIT_GROUP
export REDDIT_CONSUMER_CONFIG=$CONSUMER_CONFIG_ROOT
alias wrap-job=$REDDIT_SRC/reddit/scripts/wrap-job
alias manage-consumers=$REDDIT_SRC/reddit/scripts/manage-consumers
DEFAULT
fi
###############################################################################
# Queue Processors
###############################################################################
mkdir -p $CONSUMER_CONFIG_ROOT
function set_consumer_count {
if [ ! -f $CONSUMER_CONFIG_ROOT/$1 ]; then
echo $2 > $CONSUMER_CONFIG_ROOT/$1
fi
}
set_consumer_count log_q 0
set_consumer_count search_q 0
set_consumer_count del_account_q 1
set_consumer_count scraper_q 1
set_consumer_count markread_q 1
set_consumer_count commentstree_q 1
set_consumer_count newcomments_q 1
set_consumer_count vote_link_q 1
set_consumer_count vote_comment_q 1
set_consumer_count automoderator_q 0
set_consumer_count butler_q 1
chown -R $REDDIT_USER:$REDDIT_GROUP $CONSUMER_CONFIG_ROOT/
###############################################################################
# Start everything up
###############################################################################
# the initial database setup should be done by one process rather than a bunch
# vying with eachother to get there first
reddit-run -c 'print "ok done"'
# ok, now start everything else up
initctl emit reddit-stop
initctl emit reddit-start
###############################################################################
# Cron Jobs
###############################################################################
if [ ! -f /etc/cron.d/reddit ]; then
cat > /etc/cron.d/reddit <<CRON
0 3 * * * root /sbin/start --quiet reddit-job-update_sr_names
30 16 * * * root /sbin/start --quiet reddit-job-update_reddits
0 * * * * root /sbin/start --quiet reddit-job-update_promos
*/5 * * * * root /sbin/start --quiet reddit-job-clean_up_hardcache
*/2 * * * * root /sbin/start --quiet reddit-job-broken_things
*/2 * * * * root /sbin/start --quiet reddit-job-rising
0 * * * * root /sbin/start --quiet reddit-job-trylater
# liveupdate
* * * * * root /sbin/start --quiet reddit-job-liveupdate_activity
# jobs that recalculate time-limited listings (e.g. top this year)
PGPASSWORD=password
*/15 * * * * $REDDIT_USER $REDDIT_SRC/reddit/scripts/compute_time_listings link year "['hour', 'day', 'week', 'month', 'year']"
*/15 * * * * $REDDIT_USER $REDDIT_SRC/reddit/scripts/compute_time_listings comment year "['hour', 'day', 'week', 'month', 'year']"
# disabled by default, uncomment if you need these jobs
#* * * * * root /sbin/start --quiet reddit-job-email
#0 0 * * * root /sbin/start --quiet reddit-job-update_gold_users
CRON
fi