This is the final step in the saga of the pg vote rel destruction. We've
been dual-writing to PG and C* while gaining confidence in the pure-C*
model being able to survive full load. This kills the pgvote databases
and moves forward in a pure-cassandra world for votes which should save
us considerable operational headaches. After rolling this out, we can
not switch back without considerable effort.
When he reached the New World, Cortez burned his relational databases.
As a result his queue processors were well motivated.
The logic of this code contained a couple subtle errors that could cause
strange behavior. In reddit's current state of having two "automatic
subreddits" (which are always included in the front page set, and not
counted towards the limit), the fact that the automatic_ids list could
have an item removed while being iterated over meant that unsubscribing
from the first automatic subreddit (/r/blog) made it so that it was
effectively impossible to unsubscribe from the second one
(/r/announcements). If you unsubscribed, it would still be present in
your front page regardless, and if you stayed subscribed it would
actually be present twice.
The goal of a login ratelimit system is to prevent brute force attacks
on passwords.
The current login ratelimit system is based on VDelay which uses
exponential backoff based on IP address after failed login attempts.
This is not ideal because of corporate proxies and LSN causing the
number of false positives to be very high resulting in users getting
the dreaded "you've been doing that too much".
This new system uses a factored out version of the core ratelimiting
system which uses fixed ratelimits per period (allowing some burstiness)
and is per-account. To help mitigate the effects of a denial of service
attack on a specific user, different ratelimit buckets are used
depending on whether or not the user has used the IP the login request
is coming from before.
As an escape hatch, successfully resetting an account's password adds
the current IP to that account's recent IPs allowing it into the safer
ratelimit bucket.
The ratelimit never applies if you are currently logged in as the user,
allowing account deletion to happen regardless of ongoing brute force /
denial of service attacks.
Since we have an HTTPS-capable CDN in front of our S3 static domains
now, it's far faster for clients to use the CDN on HTTPS as well rather
than going straight to (high-latency) S3.
This patch makes it so that we continue to store URLs with explicit HTTP
schemes but instead of conditionally converting to HTTPS, we render
protocol-relative URLs. This should be safe for systems using the
filesystem media provider as we've installed an SSL cert there all
along.
Since the introduction of the media providers and the default
installation of the filesystem media provider, it's no longer necessary
for local / non-AWS installs to use dynamically served stylesheets.
This patch removes that option to reduce complexity in the stylesheet
flows.
reddit uses Google Analytics[0] as a tool to track events on the reddit.com
website, which allows for gathering page load and user event data while
keeping users anonymized. However, with the high volume[1] of traffic
that reddit recieves, the data collection limit[2]-- even with a premium
account-- is often surpassed by a large volume.
Wikpedia states[3] "... sampling is concerned with the selection of a
subset of individuals from within a statistical population to estimate
characteristics of the whole population." We can, using this principle,
send a small portion of user events to Google Analytics collection
endpoints rather than sending the entire data set and achieve a
reasonable approximation of global user behavior without exceeding
reasonable data usage limits as defined by Google Analaytics.
In order to achieve this, the Google Analytics javascript library
provides a method to set a sampling rate[4], a percentage from 1-100.
By calling:
```
_gaq.push(['_setSampleRate', '80']);
```
One can set the sample rate to 80% of users. In reddit's case, I suggest
a default sampling rate of 50%. Here, I have added the `_setSampleRate`
properties to the `_gaq` object created within `utils.html`. It gets its
value from the config, which allows for easy value changes and avoids
using a 'magic value' set multiple places in the code.
[0] - https://www.reddit.com/help/privacypolicy#p_22
[1] - https://www.reddit.com/r/AskReddit/about/traffic
[2] - https://support.google.com/analytics/answer/1070983?hl=en
[3] - http://en.wikipedia.org/wiki/Sampling_(statistics)
[4] -
https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiBasicConfiguration#_gat.GA_Tracker_._setSampleRate
For some app pools that are selected based on the incoming request
source, such as whoalane, we may want to apply the ratelimit to ALL
kinds of requests to ensure that resources are being used fairly. This
adds a strict enforcement mode which can be enabled in the config. Oauth
will continue to be enforced per-client ID but all other requests will
get the sitewide ratelimit.
Right now we only give HSTS grants when the user is on g.domain
so we can easily revoke the grant. We also track changes to the
forced HTTPS pref accross sessions and modify the user's session
cookies as needed.
A floor of 1 total link karma already existed, this adds a floor to
comment karma as well. It also starts applying them everywhere,
including in the subreddit-specific karma breakdown.
In addition, admins are now exempt from the flooring and can see the
actual karma totals for users in all locations.
This consolidates the three previous globals of REPLY_AGE_LIMIT,
VOTE_AGE_LIMIT, and REPORT_AGE_LIMIT into a single timeinterval setting
of ARCHIVE_AGE, and allows individual subreddits to use an archive_age
attr to override this default cutoff.