Dan Wheeler
92f5ce5e29
doc tweak: make usage in data-scripts consistent with filenames in data/
2015-11-09 22:53:36 -08:00
Dan Wheeler
cb040cd780
newly generated build_frequency_lists.py
2015-11-09 20:57:58 -08:00
Dan Wheeler
c8443ba867
bugfix: truncate dictionary sizes after sorting
2015-11-09 20:38:53 -08:00
Dan Wheeler
fecfe4e719
increase dictionary size to, combined, ~1M tokens
2015-11-09 19:27:08 -08:00
Dan Wheeler
31bb0df296
count_xato: tweak debugging message
2015-11-09 15:15:24 -08:00
Dan Wheeler
3bf5a6cf3a
skip non-unicode top passwords in xato. (this only skips one pw currently)
2015-11-09 14:26:44 -08:00
Dan Wheeler
0e8c7ebc5d
unicode->ascii normalization and other count_* tweaks
...
- wikipedia and wiktionary processing now use unidecode library
to convert unicode characters to ascii substitutes, for example
accented e -> e.
- wiktionary counting now disregards words ending in "'s" for
sufficiently high rank, eliminating many duplicates.
2015-11-09 13:02:07 -08:00
Dan Wheeler
31d35d88d0
build_frequency_lists.py overhaul
2015-11-08 23:52:38 -08:00
Dan Wheeler
bc1f782a2d
pull tokenization/normalization/counting logic into a separate file for each raw dictionary source
2015-11-08 17:11:49 -08:00
Dan Wheeler
81ed2c4d55
speed up builds by joining dictionaries in frequency_lists.coffee into
...
single strings
coffee compilation now takes about 1s instead of 20;
full `npm run build` takes about 2-3s instead of 30.
2015-08-16 23:39:37 -07:00
Dan Wheeler
e20e715093
naming: update for data-scripts
2015-08-07 14:31:13 -07:00
Dan Wheeler
c069580ae7
directory cleanup! .coffee in src, compiled/bundled/minified .js in lib
2015-08-05 11:24:02 -07:00