mirror of
https://github.com/reddit-archive/reddit.git
synced 2026-04-27 03:00:12 -04:00
By default, Linux's sort(1) uses locale-based sorting. Normally this is what
humans want, but for mapreduce it breaks the guarantee that the same reducer
always sees each instance of the same key. Here's an example:
user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9
user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj
user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb
user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8
user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb
user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1
user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50
user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2
user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx
user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3
user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu
user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm
user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe
According to sort(1) using the default locale, this is already sorted.
Unfortunately, that means that to a reducer this list represents 6 different
listings (each of which will overwrite the previous runs of the same listing).
But that's not what we want. It's actually two listings, like:
user/comment/top/week/1102922 26 1453516098.92 t1_cz8jgq9
user/comment/top/week/1102922 3 1453527927.97 t1_cz8ovzj
user/comment/top/week/1102922 4 1453515976.97 t1_cz8jee8
user/comment/top/week/1102922 4 1453519790.67 t1_cz8lavb
user/comment/top/week/1102922 7 1453521946.74 t1_cz8mb50
user/comment/top/week/1102922 7 1453524230.93 t1_cz8ncj2
user/comment/top/week/1102922 7 1453527760.32 t1_cz8otkx
user/comment/top/week/1102922 7 1453528700.96 t1_cz8p6u3
user/comment/top/week/11029224 1 1453662674.45 t1_cza98tb
user/comment/top/week/11029224 2 1453827188.31 t1_czcotf1
user/comment/top/week/11029228 1 1453285875.44 t1_cz525gu
user/comment/top/week/11029228 1 1453292202.65 t1_cz53ulm
user/comment/top/week/11029228 1 1453292232.55 t1_cz53uxe
To do this, we need to set the enviroment variable LC_ALL=C when running sort(1)
to indicate that the sorting should operate only on the raw bytes.
It looks like this has been broken since the Trusty Tahr upgrade.