reddit

mirror of https://github.com/reddit-archive/reddit.git synced 2026-01-11 16:07:56 -05:00

Author	SHA1	Message	Date
Brian Simpson	1d0dda280a	compute_time_listings: Report exceptions to Sentry	2016-10-19 16:34:22 -07:00
Brian Simpson	fd17945edf	Delete log_q	2016-10-04 13:45:11 -07:00
David Wick	faac2af02f	tracker.py: Ensure /click endpoint doesn't modify destination urls Omitting `keep_blank_values` was dropping blank query parameters. Furthermore, converting the output of `parse_qsl` to a dictionary was unnecessarily modifying the order of parameters since dicts are not ordered. Fortunately `urllib.urlencode` also accepts a sequence of two-element tuples and the order of parameters in the encoded string will match the order of parameter tuples in the sequence.	2016-09-13 15:34:41 -07:00
Joshua Uziel	805a9a4b79	Onboarding: Featured subreddits API endpoint Endpoint for retrieving subreddits to make new users aware of without actually subscribing them while doing the onboarding process. The list of featured subreddits will be managed separately from the default ones. For testing purposes, we add a subreddit to show up via the inject_test_data script.	2016-06-20 16:19:37 -07:00
Brian Simpson	a4bf06235c	TryLater: move all processing and cleanup into class method Previously it was split between the class and the script handler	2016-06-15 11:18:59 -07:00
Brian Simpson	4292df773f	TryLater: Use xget to process ALL ready items	2016-06-15 11:18:59 -07:00
Brian Simpson	d4cb4087b5	run_trylater: Flush the stats	2016-06-15 11:18:59 -07:00
Greg Taylor	31b83a3350	Incorporating PEP8 checking on PRs.	2016-06-15 11:10:21 -07:00
Jack Niu	734433664d	support PM delete opration on Recipient side only	2016-05-18 14:53:25 -07:00
zeantsoi	0e45ccefe0	Implement ads auction Complete iteration of our transition to an auction model. Non-sponsor users can only create auction campaigns; sponsors can create either. All other users are forced to create legacy fixed CPM campaigns.	2016-05-04 10:43:20 -07:00
David King	bdab877b2e	Fix tuples_to_sstables for Cassandra 1.2 * The constructor for `SSTableSimpleUnsortedWriter` changed * `sstableloader` changed its directory structure. Now we just don't manage it at all in `tuples_to_sstables` and make the caller do the work	2016-03-03 14:43:06 -08:00
David King	c8f10bb7b8	Parallelise parts of mr_top jobs Explanation: compute_time_listings is slow. Really slow. At a quick glance, here are the jobs running right now: date: Sun Jan 17 20:04:56 PST 2016 -rw-rw-r-- 1 ri ri 1.2G Jan 17 12:37 comment-week-data.dump -rw-rw-r-- 1 ri ri 683M Jan 17 12:25 comment-week-thing.dump -rw-rw-r-- 1 ri ri 53G Jan 16 07:13 comment-year-data.dump -rw-rw-r-- 1 ri ri 31G Jan 16 04:37 comment-year-thing.dump -rw-rw-r-- 1 ri ri 276M Jan 17 17:04 link-week-data.dump -rw-rw-r-- 1 ri ri 70M Jan 17 17:03 link-week-thing.dump So the currently running top-comments-by-year listing has been running for nearly 37 hours and isn't done. top-comments-by-week has been running for 8 hours. top-links-by-week has been running for 3 hours. And this is just me checking on currently running jobs, not actual completion times. The slow bit is the actual writing to Cassandra in `write_permacache`. This is mostly because `write_permacache` is extremely naive and blocks waiting for individual writes with no batching or parallelisation. There are a lot of ways to work around this and some of them will become easier when we're not longer writing out to the permacache at all, but until then (and even after that) this approach lets us keep doing the simple-to-understand thing while parallelising some of the work. The approach: `compute_time_listings` is written as a mapreduce job in our `mr_tools` toolkit, with `write_permacache` as the final reducer. In `mr_tools`, you can run multiple reducers as long as a given reducer can be guaranteed to receive all of the keys for the same key. So this patch adds `hashdist.py`, a tool that runs multiple copies of a target job and distributes lines to them from stdin using their first tab-delimited field to meet this promise. (The same script could apply to mappers and sorts too but in my tests for this job the gains were minimal because `write_permacache` is still the bottleneck up to a large number of reducers.) Numbers: A top-links-by-hour listing in prod right now takes 1m46.387s to run. This patch reduces that to 0m43.960s using 2 jobs (a 60% savings). That top-links-by-week job that before I killed after 3 hours completed in 56m47.329s. The top-links-by-year job that I killed last week at over 36 hours finished in 19 hours. Downsides: It costs some additional RAM: roughly 10mb for hashdist.py and 100mb in memory for each additional copy of the job. It multiplies the effective load on Cassandra by the number of jobs (although I have no reason to believe that it's practical to overload Cassandra this way right now; I've tested up to 5 jobs). Further work: with this we could easily do sort\|reducer fusion to significantly reduce the work required by the sorter. `hashdist.py` as written is pretty slow and is only acceptable because `write_permcache` is even slower; a non-Python implementation would be straight forward and way faster.	2016-02-18 15:35:58 -08:00
David King	f63050fb14	Make the mr_top reducer run sort(1) with LC_ALL=C By default, Linux's sort(1) uses locale-based sorting. Normally this is what humans want, but for mapreduce it breaks the guarantee that the same reducer always sees each instance of the same key. Here's an example: user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9 user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8 user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1 user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50 user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2 user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3 user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe According to sort(1) using the default locale, this is already sorted. Unfortunately, that means that to a reducer this list represents 6 different listings (each of which will overwrite the previous runs of the same listing). But that's not what we want. It's actually two listings, like: user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9 user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8 user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50 user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2 user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3 user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1 user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe To do this, we need to set the enviroment variable LC_ALL=C when running sort(1) to indicate that the sorting should operate only on the raw bytes. It looks like this has been broken since the Trusty Tahr upgrade.	2016-02-18 15:35:02 -08:00
David King	6e71efa726	Fix a bug in compute_time_listings that would allow simultaneous runs The cause of the bug is that if we fail to start because someone else has already started, we still delete their files. From job-02 right now: write_permacache() # comment ("day", "week+ write_permacache() # link ("day","week") write_permacache() # link ("day","week") write_permacache() # comment ("day", "week+ write_permacache() # link ("month","year") write_permacache() # comment ("month", "ye+	2016-02-18 15:33:01 -08:00
David King	c5f26d235b	Speed up mr_top permacache writes by about 30% * Make them lockless, because mr_top is the only one that ever writes to it. This avoids a lot of memcached round trips * Don't set them to permacache_memcaches. Delete from it instead. This keeps us from blowing out that whole cache with listings that will never be read out of every time mr_top runs. * Fix a performance bug in _mr_tools.mr_reduce_max_per_key	2016-02-18 15:32:56 -08:00
Brian Simpson	051a934e7c	inject_test_data: Set LocalizedDefaultSubreddits.	2015-12-02 13:32:42 -08:00
Ricky Ramirez	b4e0849eed	geoip: Update paths to databases. geoipupdate 1.6.6 now drops files into /var/lib/GeoIP	2015-11-19 16:03:01 -08:00
David Wick	db34bcaf40	Ads: Do not unescape click urls	2015-11-16 15:49:59 -08:00
Brian Simpson	91e1221d83	inject_test_data: Make sure promos subreddit exists.	2015-11-12 15:32:12 -08:00
Chad Birch	0c637151b5	Major rewrite/refactor of voting code This is a rewrite of much of the code related to voting. Some of the improvements include: - Detangling the whole process of creating and processing votes - Creating an actual Vote class to use instead of dealing with inconsistent ad-hoc voting data everywhere - More consistency with naming and other similar things like vote directions (previously had True/False/None in some places, 1/-1/0 in others, etc.) - More flexible methods in determining and applying the effects of votes - Improvement of modhash generation/validation - Removing various obsolete/unnecessary code	2015-11-09 17:57:42 -07:00
Chad Birch	8ce2db7386	Liveconfig diff: use pprint to generate dict reprs The write_live_config script often gives really strange results when trying to display diffs of changes to dict values, since the ordering of a dict is not defined. So key/value pairs will sometimes be rearranged between the old and new versions, creating a confusing diff. This changes to use the pprint module to generate string representations for dicts, because it sorts the dict by key before outputting it, so we should get consistent representations that can be compared more easily.	2015-10-14 12:46:53 -06:00
Brian Simpson	d990533d0b	Upgrade pylons from 0.9.7 to 1.0. http://pylons-webframework.readthedocs.org/en/latest/upgrading.html This requires several code changes: * pylons `config` option must be explicitly passed during setup * the pylons global has been renamed from `g` to `app_globals` * the pylons global has been renamed from `c` to `tmpl_context` * set pylons.strict_tmpl_context = False (instead of pylons.strict_c) * redirect_to() has been swapped for redirect() * must implement `ErrorDocuments` middleware ourselves pylons 1.0 also required an upgrade of routes from 1.11 to 1.12. This required the following changes: * set Mapper.minimization = True (the default value changed) * set Mapper.explicit = False (the default value changed)	2015-09-15 06:35:31 -04:00
Alexander Putilin	6889236c8c	Fix inject_test_data.py after changes in Link._submit	2015-09-02 12:28:59 -07:00
Chad Birch	df68d45d68	write_live_config: check similarity when diffing Turns out that SequenceMatcher is not quite as neat as it appears. When diffing a string change like: old: "this is the old string value" new: "on" It was displayed as: - "this is the " - "ld stri" - "g value" Since the "o" and "n" were kept. This just displays it as a wholesale key removal/addition unless the strings are at least 50% similar.	2015-08-20 17:32:51 -06:00
Chad Birch	650b7f7266	write_live_config: Show only changed settings The script would previously dump out the entire new parsed live configuration, which (now that we have a lot of fields in it) made it difficult to find the ones that had actually changed. This fetches the existing live config, then compares it to the new one and only outputs any data that has changed for confirmation.	2015-08-20 14:05:06 -06:00
Dustin Pho	e3431123ab	Fix magic number for SHA1 hash length	2015-07-30 04:03:19 -04:00
Matthew94	3469d62291	geoip_service: use dict comprehensions for clarity.	2015-07-30 03:44:11 -04:00
Brian Simpson	d846f0609d	Update SubscribedSubredditsByAccount backfill. There's no index on SRMember.c.rel_id so instead sort the query by SRMember.c.thing2_id (the user's id). Also the timestamp for writing to C* is updated to an integer timestamp corresponding to when the dual write was deployed.	2015-06-30 18:41:09 -04:00
Brian Simpson	fb8ca7fcbb	Dual write SubscribedSubredditsByAccount and add backfill script.	2015-06-30 18:41:08 -04:00
xiongchiamiov	0702bc4886	Image previews: fix old urls We've changed the url structure of image previews a number of times, which breaks everything uploaded prior to the latest version. This script should find all preview images that have been uploaded thus far, move them to the appropriate place, and save an updated and correct storage url in every Link that uses them.	2015-05-26 11:47:56 -07:00
Jordan Milne	644f0988d7	Add a shim for requests' `response.json` to ease upgrading to 2.x See http://docs.python-requests.org/en/latest/api/#migrating-to-1-x for the rationale, `.json()` also differs from `.json` in that it `raise`s instead of returning `None` on a decoding error, but that shouldn't affect us anywhere. Conflicts: r2/r2/lib/media.py	2015-05-15 13:28:44 -07:00
Brian Simpson	91699b7acd	Add backfill script for CommentScoresByLink.	2015-05-12 08:44:11 -04:00
umbrae	03bc77b203	Beta mode: Add preference and subreddit callouts	2015-05-07 10:57:47 -07:00
MelissaCole	f05603ad98	Add backfill script for num_gildings This script will update Account.num_gildings for all gildings in the gold_table (which is where trans_id like 'X%' in the gold table).	2015-05-06 11:10:39 -07:00
xiongchiamiov	6ef290a1cf	Userpage: fix top listings for comments If you go to a userpage and sort by top (in either the overview or comments tabs), and restrict the time range to anything other than "all time", no comments will be shown. The data in these listings is built from functions in `lib/db/queries.py` (specifically from `get_comments()` down). This ends up trying to pull the query results from permacache (in `CachedResults.fetch_multi()`), defaulting to an empty list if no cache entry is found. Now, the cache entry is supposed to be populated periodically by a cronjob that calls `scripts/compute_time_listings`. This script (and its Python helpers in `lib/mr_top.py` and `lib/mr_tools/`) generates a dump of data from Postgresql, then reads through that and builds up entries to insert into the cache. As with many scripts of this sort, it expects to get in some bad data, and so performs some basic sanity checks. The problem is that the sanity checks have been throwing out all comments. With no new comments, there's nothing new to put into the cache! The root of this was a refactoring in reddit/reddit@3511b08 that combined several different scripts that were doing similar things. Unfortunately, we ended up requiring the `url` field on comments, which doesn't exist because, well, comments aren't links. Now we have two sets of fields that we expect to get, one for comments and one for links, and all is good. We also now have a one-line summary of processed/skipped entries printed out, which will help to make a problem like this more obvious in the future.	2015-04-30 15:53:33 -07:00
zeantsoi	7f1aa2a520	Command line script to add subreddit to collection	2015-04-15 17:04:46 -07:00
Chad Birch	0fbea80d45	Integrate AutoModerator into the site	2015-03-31 14:56:19 -06:00
xiongchiamiov	fb507e59a6	TryLater: use more generic parameter name `mature_items` made sense in the original context, but now that I'm stealing it for other uses it's really just some set of some sort of data.	2015-03-03 15:50:04 -08:00
Jordan Milne	28a913c242	Add backfill script for deleted user accounts	2015-03-03 14:26:22 -08:00
Neil Williams	274a8d7008	Overhaul populatedb script. This new script attempts to generate some subreddits that are more like production data. It first pulls down data from reddit.com, then uses markov chains to generate new data for insertion into the databases.	2015-03-02 14:44:57 -08:00
Neil Williams	1605b48eda	upload_static_files_to_s3: Improve logging clarity. The goal is to make it seem less like the build is "hanging" during the upload step.	2015-03-02 10:41:07 -08:00
Neil Williams	30e2256fdd	upload_static_files_to_s3: Remove vestigial support for gzipped statics. This feature was never really used and the core support for it was torn out in `68857e1a7d`.	2015-03-02 10:41:07 -08:00
Neil Williams	64a469e64e	upload_static_files_to_s3: Rework to use command line arguments. * configuration now comes from the command line so it's easier to use for multiple projects. * the bucket is now an s3 url allowing a path prefix to be added to files. * authentication now comes from boto's credential system which allows us to use IAM roles.	2015-03-02 10:41:07 -08:00
Neil Williams	4090acd8d8	Remove unused static file cleaner. It never really worked right and is getting in the way now.	2015-03-02 10:41:07 -08:00
Neil Williams	f501106d9c	zookeeper: gzip live config payload. This takes our current config payload from 4700 bytes to 1700. The goal is to reduce zookeeper network load during config changes as well as app restarts during deploys.	2015-03-02 10:41:07 -08:00
Neil Williams	f00e90f0ed	zookeeper: Remove obsolete LiveDict class. This is no longer used since the relevant consumers have switched to Cassandra as a backing store.	2015-03-02 10:41:07 -08:00
umbrae	0850bd3044	Tracker: URL decode session cookie	2015-02-20 23:58:51 -08:00
umbrae	ea5aa9c538	Tracker: add domain prefix to redirect domain	2015-02-20 21:49:46 -08:00
umbrae	76fb41a3f8	Tracker: add session tracking redirector This adds in two redirects - `event_click` and `event_redirect` - `event_click` to allow appending in a user ID to an event before redirect, if we require one, and `event_redirect` to service a local evented redirect, similar to ad clicks. `event_click` is necessary for tracking clicks from users on embeds, which are served via redditmedia, and therefore are always anonymous. When a user clicks through, we want to know who they were and redirect them on their way. Because of the way we're using nginx to store events as an access log right now, this means we'll need to use two redirects: one to append the session ID and another to store the event with the proper session ID.	2015-02-20 21:23:09 -08:00
John-William Trenholm	9e64e56960	tracker.py: use env var for configuration file Upstart needs to use a environment variable to determine the configuration file.	2015-02-12 16:00:04 -08:00

1 2 3

145 Commits