Files
reddit/scripts
David King f63050fb14 Make the mr_top reducer run sort(1) with LC_ALL=C
By default, Linux's sort(1) uses locale-based sorting. Normally this is what
humans want, but for mapreduce it breaks the guarantee that the same reducer
always sees each instance of the same key. Here's an example:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

According to sort(1) using the default locale, this is already sorted.
Unfortunately, that means that to a reducer this list represents 6 different
listings (each of which will overwrite the previous runs of the same listing).
But that's not what we want. It's actually two listings, like:

    user/comment/top/week/1102922   26  1453516098.92   t1_cz8jgq9
    user/comment/top/week/1102922   3   1453527927.97   t1_cz8ovzj
    user/comment/top/week/1102922   4   1453515976.97   t1_cz8jee8
    user/comment/top/week/1102922   4   1453519790.67   t1_cz8lavb
    user/comment/top/week/1102922   7   1453521946.74   t1_cz8mb50
    user/comment/top/week/1102922   7   1453524230.93   t1_cz8ncj2
    user/comment/top/week/1102922   7   1453527760.32   t1_cz8otkx
    user/comment/top/week/1102922   7   1453528700.96   t1_cz8p6u3
    user/comment/top/week/11029224  1   1453662674.45   t1_cza98tb
    user/comment/top/week/11029224  2   1453827188.31   t1_czcotf1
    user/comment/top/week/11029228  1   1453285875.44   t1_cz525gu
    user/comment/top/week/11029228  1   1453292202.65   t1_cz53ulm
    user/comment/top/week/11029228  1   1453292232.55   t1_cz53uxe

To do this, we need to set the enviroment variable LC_ALL=C when running sort(1)
to indicate that the sorting should operate only on the raw bytes.

It looks like this has been broken since the Trusty Tahr upgrade.
2016-02-18 15:35:02 -08:00
..
2015-09-15 06:35:31 -04:00
2015-09-15 06:35:31 -04:00
2015-01-08 13:35:03 -08:00
2015-01-08 13:35:03 -08:00
2015-11-16 15:49:59 -08:00
2015-09-15 06:35:31 -04:00
2015-01-08 13:35:03 -08:00