Commit Graph

7284 Commits

Author SHA1 Message Date
Brian Simpson
ca800d4eb2 CommentBuilder: Separate handling of MoreRecursions
This isolates the work and logic to one place rather than spread out in the
main CommentBuilder function.
2016-02-29 12:19:45 -08:00
Chris Slowe
5dabab71cb LoID fixup: only create new on non-static render types 2016-02-26 10:09:13 -08:00
Chris Slowe
ea17655abf Add server-side loid generation and cookieing.
This doesn't remove the existing loid generation in JS, but will supercede it as the first request from the server will have the an loid cookie.

This also doesn't change the current cookie behavior, but rather localizes all of the loid generation code to a module so that we can eventually fix it up and not break behavior in the rest of the app.
2016-02-26 10:08:19 -08:00
Chris Slowe
05fa36f801 Add ipv4/24 & ipv4/16 (obfuscated), and hooks to events 2016-02-26 10:07:19 -08:00
Chris Slowe
0ddc88523f Events: track rank of each link in a listing in screenview_events
Example payload:

```json
[
  {
    "event_topic":"screenview_events",
    "event_type":"cs.screenview",
    "event_ts":1455145909943,
    "uuid":"b1507273-c04e-43ab-b339-fb91f72b72b5",
    "payload": {
      "user_id":52,
      "user_name":"KeyserSosa",
      "listing_name":"frontpage",
      "rank_by_link":{"t5_20":1,"t5_1z":2,"t5_1y":3,"t5_1x":4,"t5_1w":5,"t5_1v":6,"t5_1u":7,"t5_1t":8,"t5_1s":9,"t5_1r":10,"t5_1q":11,"t5_1p":12,"t5_1o":13,"t5_1n":14,"t5_1m":15,"t5_1l":16,"t5_1k":17,"t5_1j":18,"t5_1i":19,"t5_1h":20,"t5_1g":21,"t5_1f":22,"t5_1e":23,"t5_1d":24,"t5_1c":25},
      "language":"en",
      "dnt":false,
      "referrer_domain":"reddit.local",
      "referrer_url":"https://reddit.local/",
      "target_type":"listing",
      "target_sort":"new",
      "app_name":"reddit.com",
      "utc_offset":-8,
      "user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36",
      "domain":"reddit.local",
      "base_url":"/new/"
    }
  }
]
```
2016-02-26 10:05:09 -08:00
Chris Slowe
1af69e491b Anti-Evil: move nofollow logic into a separate classmethod with a hook 2016-02-26 10:01:25 -08:00
Brian Simpson
cb72d46be7 CommentBuilder: Add a method to modify the comment tree
This is instead of bundling the modifications in with the logic
to determine the initial candidate comments.
2016-02-25 15:03:46 -08:00
Brian Simpson
f8e76c5226 CommentBuilder: don't shuffle the top level MoreChildren 2016-02-25 14:03:48 -08:00
Brian Simpson
c867b98619 CommentBuilder: Add comment about MoreChildren depth bug 2016-02-25 14:03:48 -08:00
Brian Simpson
e6ccea00a2 CommentBuilder: Add some comments in the constructor
The CommentBuilder has 3 different modes:
* regular mode for building a comments page
* permalink mode for building the tree starting from a specific comment
* morechildren mode for building the tree starting from specific comments
2016-02-25 14:03:48 -08:00
Brian Simpson
6fe0169198 Move make_wrapper to things.py 2016-02-25 14:03:48 -08:00
Brian Simpson
e272a9fe15 Delete unnecessary link_comments_and_sort 2016-02-25 14:03:48 -08:00
Brian Simpson
c94efffcc9 Delete unnecessary get_comment_tree 2016-02-25 14:03:44 -08:00
Brian Simpson
c74c75800d CommentTree: Calculate num_children
num_children is derived directly from the tree, so it makes sense
to calculate it on CommentTree retrieval rather than waiting and making
the CommentBuilder handle it. Calculating child counts in advance for the
full tree appears to be faster than the previous method of using
`get_num_children()` to calculate the child count only for visible comments.
2016-02-25 12:24:05 -08:00
Brian Simpson
ff01943ca7 CommentBuilder: Add test for limited children 2016-02-25 12:24:05 -08:00
Brian Simpson
a2004b2e45 CommentBuilder: Add test for permalink defocusing 2016-02-25 12:24:00 -08:00
Brian Simpson
cce71b0a24 js_config: Guard against unset c.user 2016-02-24 12:29:55 -08:00
Brian Simpson
9caccc3043 _handle_register: Handle AccountExists exception
This is normally caught by the VUname validator, but simultaneous
requests can both pass that check and then the second one will trigger
the exception when actually trying to create the account.
2016-02-24 12:13:18 -08:00
Ricky Ramirez
bddb833e96 Update PII retention periods to 100 days.
This was bumped from 90 to 100 days in the latest version of our privacy
policy.
2016-02-24 11:53:19 -08:00
Chad Birch
b0d393b8ff Update ChillingEffects references to Lumen
Chilling Effects has changed its name to Lumen and its url to
lumendatabase.org: https://www.lumendatabase.org/blog_entries/763
2016-02-23 18:32:13 -07:00
Brian Simpson
89cd42b710 Don't show ads in submit duplicates listing 2016-02-23 15:24:38 -08:00
Brian Simpson
d8ce869312 On submit, show up to 100 duplicates rather than all duplicates 2016-02-23 15:24:30 -08:00
Brian Simpson
98f6365892 Stop writing LinksByUrl
It's now been completely replaced by LinksByUrlAndSubreddit
2016-02-23 15:24:22 -08:00
Brian Simpson
b18c4f17ad Read from LinksByUrlAndSubreddit 2016-02-23 15:24:14 -08:00
Brian Simpson
dffbae446a Add LinksByUrlAndSubreddit backfill script 2016-02-23 15:24:05 -08:00
Brian Simpson
a497cbfb37 Dual write to LinksByUrlAndSubreddit 2016-02-23 15:23:53 -08:00
Brian Simpson
8ac7bfc075 Add LinksByUrlAndSubreddit
This will replace LinksByUrl.
2016-02-23 15:23:36 -08:00
Chad Birch
19b38a4615 Cloudflare purge: stop triple-purging
Supposedly the purging is more reliable now, so it shouldn't be
necessary to purge everything three times to be sure it works.
2016-02-22 13:51:40 -07:00
13steinj
05843ebce7 Don't load an empty box for non mods if there is no stylesheet 2016-02-22 14:24:59 -05:00
13steinj
a5d1b68882 Add Images to /about/stylesheet for non-mods
Adds images to /about/stylesheets for easy view, even when a non mod, since the images were already accessible via inspect element as well as via the JSON template, as semi-wanted on /r/csshelp numerous times.
2016-02-22 14:24:59 -05:00
Florence Yeun
e2864b8104 Send login events after successful login
Send `register_attempt` and `login_attempt` events after a successful
login so that the logged in user id is available.
2016-02-19 11:29:45 -08:00
Florence Yeun
b3f3bb80e5 Events: Add login modal event
Add `r.analytics.loginRequiredEvent()`, similar to
`r.analytics.timeoutForbiddenEvent()`.
2016-02-19 11:29:41 -08:00
Florence Yeun
9463f2615e Events: Add login events
Add events for `register_attempt`, `login_attempt`, and `password_reset`.
2016-02-19 11:29:32 -08:00
David King
3fe393ac9e Add titles and bodies to modaction API responses
As requested here https://www.reddit.com/r/changelog/comments/428vdq/upcoming_reddit_change_switching_from_rss_20_to/cz8l0oq
2016-02-18 15:36:15 -08:00
David King
6b90ecad3b Atom <link rel="alternate"> elements should point to HTML content
Right now both <link> elements point to the Atom content rather than
one Atom and one HTML

Before:

    <link rel="self" href="https://ketralnis.staging.snooguts.net/r/programming.xml" type="application/atom+xml" />
    <link rel="alternate" href="https://ketralnis.staging.snooguts.net/r/programming.xml" type="text/html" />

After:

    <link rel="self" href="https://ketralnis.staging.snooguts.net/r/programming.xml" type="application/atom+xml" />
    <link rel="alternate" href="https://ketralnis.staging.snooguts.net/r/programming" type="text/html" />
2016-02-18 15:36:11 -08:00
David King
6484a52fad Atom: <published/> -> <updated/>
Atom requires `<updated/>` and does not require `<published/>`. This is a simple
rename of this field
2016-02-18 15:36:06 -08:00
David King
e201463db9 Stop doing permacache mutations in a background thread
Using background threads causes annoying complications in general (like right
now if the worker thread is backed up but we restart the app we just lose the
work). But the particular problem at hand is that the permacache zlib patch
wants to do compression. That can cost a lot of CPU so we want to know how long
it takes which means that we need timers. But timers work by reading from
g.stats, and g isn't available in the background thread.

We could find some other way to do the timers, but I'd rather just stop using
the AMQP background thread for anything but actual AMQP stuff which will kill
both birds with one stone.

At a glance it looks like this would cost us roughly 25ms per update to these
listings. 99th percentile goes up to 480ms but seems to float nearer to 100ms.

Deletes has a crazier 99th percentile but a similar mean. Way fewer of them
happen.
2016-02-18 15:36:02 -08:00
David King
c8f10bb7b8 Parallelise parts of mr_top jobs
**Explanation**: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now:

    date: Sun Jan 17 20:04:56 PST 2016
    -rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump
    -rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump
    -rw-rw-r-- 1 ri ri  53G Jan 16 07:13 comment-year-data.dump
    -rw-rw-r-- 1 ri ri  31G Jan 16 04:37 comment-year-thing.dump
    -rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump
    -rw-rw-r-- 1 ri ri  70M Jan 17 17:03 link-week-thing.dump

So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times.

The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work.

**The approach**: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.)

**Numbers**: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours.

**Downsides**: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs).

**Further work**: with this we could easily do sort|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.
2016-02-18 15:35:58 -08:00
David King
b5c8c91f8d Move vote queues' listing updates into the foreground
`add_queries` occurs in the AMQP worker thread by default. The results of
https://github.com/reddit/reddit-public/pull/4687 will tell us if that's a good
idea in general, but in the mean time it probably doesn't make sense to do it
for queue processes
2016-02-18 15:35:54 -08:00
David King
5d5d41ebee Quantify the time we're spending doing background permacache mutations
In the permacachezlib patch we want to be able to time how long we're spending
doing the compression and serialisation. There are times that we use the
permacache in the AMQP worker thread, which can't get to g, so it can't get to
g.stats. The solution is probably to just stop using the permacache in this way,
but first we'll want to know what it would cost us to start doing these
operations in the foreground.

Most permacache mutations occur in queue processes where it shouldn't matter,
but there are cases like deletions and inboxes where they occur in-request. So
knowing the costs of foregrounding them is important.

Note that this causes the timers to occur in the worker thread. That has known
to be safe in the past, but it's been a while since the support was used.
2016-02-18 15:35:16 -08:00
David King
77d15d84db Add taglinetext to inbox Atom feeds
I like this a little better
2016-02-18 15:35:13 -08:00
David King
d5e3a7093e Don't use StringTemplate variables in Messages in Atom 2016-02-18 15:35:09 -08:00
David King
3a7d1ed3b5 Fix a template bug in Atom timestamps on Messages 2016-02-18 15:35:06 -08:00
David King
f63050fb14 Make the mr_top reducer run sort(1) with LC_ALL=C
By default, Linux's sort(1) uses locale-based sorting. Normally this is what
humans want, but for mapreduce it breaks the guarantee that the same reducer
always sees each instance of the same key. Here's an example:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

According to sort(1) using the default locale, this is already sorted.
Unfortunately, that means that to a reducer this list represents 6 different
listings (each of which will overwrite the previous runs of the same listing).
But that's not what we want. It's actually two listings, like:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

To do this, we need to set the enviroment variable LC_ALL=C when running sort(1)
to indicate that the sorting should operate only on the raw bytes.

It looks like this has been broken since the Trusty Tahr upgrade.
2016-02-18 15:35:02 -08:00
David King
57cd593cc1 Add a simple profiling decorator for convenience
I've been using this and it's handy in a pinch
2016-02-18 15:34:59 -08:00
David King
8937ae2d3a Switch default Atom <content> type from XHTML to HTML
This is to resolve the problem that we can generate markdown with named entities
like `&nbsp;`. These are valid in XHTML's DTD but because entities aren't
namespaced they aren't valid in Atom documents, even within the XHTML namespace.
The fix is to switch to the RSS style double-escaped HTML type for `<content>`
blocks

Another approach would be to used numeric entities instead, but since we allow
users to type HTML entities we'd have to replace them too. That's a pretty large
change for such a small problem.

As a side effect of this approach, this means we also have to get rid of the
"[23 comments]" label on Link's feed items because the double-escaping breaks
the StringTemplate replacement
2016-02-18 15:34:55 -08:00
David King
5ec85af73b Fix absolute logo references in Atom feeds
Before:

    <icon>//www.reddit.com/static/icon.png</icon>

After:

    <icon>https://www.reddit.com/static/icon.png</icon>
2016-02-18 15:34:51 -08:00
David King
e87de77b1a Convert RSS feeds to Atom
[Atom](https://tools.ietf.org/html/rfc4287#section-4.1.2) is a format similar to RSS with [wide client support](https://en.wikipedia.org/wiki/Comparison_of_feed_aggregators#Web_feed_and_protocol_support) that fixes many of the warts in RSS. In particular, it has proper support for HTML without content sniffing, which will allow us to avoid XSS issues

![Up And Atom!](https://i.imgur.com/nbHbHg0.gif)
2016-02-18 15:34:42 -08:00
David King
8076ed13f1 Security fix: double escape more stuff in RSS feeds
We have our Mako filters set to escape HTML by default. Unfortunately RSS
requires double escaping in some places and not in others, so there isn't a
reasonable default. Here I have done a pass through the `.xml` templates to find
user data that's ending up single-escaped and added double escaping to them.

This requirement is because RSS grew HTML support organically in a way that
clients can't tell if a field actually contains HTML or not. Sometimes it's
double escaped, sometimes it's not. Clients have to take a guess by sniffing for
`&gt;` characters and hoping they get it right. This also means that it's
impossible for servers to reliably tell the clients which data this field
contains.

This is a bit of a ticking time bomb. Users may find ways to sneak in HTML in
the date field, or we may add new templates that forget to do double escaping on
little-used fields. I recommend that we switch these to use Atom which always
indicates whether the fields contain HTML or not. Work has started on this
conversion in another branch.

* ref https://www.reddit.com/r/AskNetsec/comments/41larg/titleheadbody_idmsgfeedsummarybodyimg/
* ref https://reddit.atlassian.net/browse/INFRA-721
* ref https://bugzilla.mozilla.org/show_bug.cgi?id=1240603
2016-02-18 15:33:59 -08:00
David King
6e71efa726 Fix a bug in compute_time_listings that would allow simultaneous runs
The cause of the bug is that if we fail to start because someone else has already started, we still delete their files.

From job-02 right now:

    write_permacache() # comment ("day", "week+
    write_permacache() # link ("day","week")
    write_permacache() # link ("day","week")
    write_permacache() # comment ("day", "week+
    write_permacache() # link ("month","year")
    write_permacache() # comment ("month", "ye+
2016-02-18 15:33:01 -08:00