By default, Linux's sort(1) uses locale-based sorting. Normally this is what
humans want, but for mapreduce it breaks the guarantee that the same reducer
always sees each instance of the same key. Here's an example:
user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9
user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj
user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb
user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8
user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb
user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1
user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50
user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2
user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx
user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3
user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu
user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm
user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe
According to sort(1) using the default locale, this is already sorted.
Unfortunately, that means that to a reducer this list represents 6 different
listings (each of which will overwrite the previous runs of the same listing).
But that's not what we want. It's actually two listings, like:
user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9
user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj
user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8
user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb
user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50
user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2
user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx
user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3
user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb
user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1
user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu
user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm
user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe
To do this, we need to set the enviroment variable LC_ALL=C when running sort(1)
to indicate that the sorting should operate only on the raw bytes.
It looks like this has been broken since the Trusty Tahr upgrade.
Greetings!
This is the primary codebase that powers reddit.com.
For notices about major changes and general discussion of reddit development, subscribe to the /r/redditdev and /r/changelog subreddits.
You can also chat with us via IRC in #reddit-dev on freenode.
Quickstart
To set up your own instance of reddit to develop with, we have a handy install script for Ubuntu that will automatically install and configure most of the stack.
Alternatively, refer to our Install Guide for instructions on setting up reddit from scratch. Many frequently asked questions regarding local reddit installs are covered in our FAQ.
APIs
To learn more about reddit's API, check out our automated API documentation and the API wiki page. Please use a unique User-Agent string and take care to abide by our API rules.
Happy hacking!
Issues and Contribution Guidelines
Thanks for wanting to help make reddit better! First things first, though: github issues is only for confirmed, active bugs. Please submit ideas to /r/ideasfortheadmins.
Please read more on contributions in CONTRIBUTING.md.