12 Commits

Author SHA1 Message Date
Dan Wheeler
92f5ce5e29 doc tweak: make usage in data-scripts consistent with filenames in data/ 2015-11-09 22:53:36 -08:00
Dan Wheeler
cb040cd780 newly generated build_frequency_lists.py 2015-11-09 20:57:58 -08:00
Dan Wheeler
c8443ba867 bugfix: truncate dictionary sizes after sorting 2015-11-09 20:38:53 -08:00
Dan Wheeler
fecfe4e719 increase dictionary size to, combined, ~1M tokens 2015-11-09 19:27:08 -08:00
Dan Wheeler
31bb0df296 count_xato: tweak debugging message 2015-11-09 15:15:24 -08:00
Dan Wheeler
3bf5a6cf3a skip non-unicode top passwords in xato. (this only skips one pw currently) 2015-11-09 14:26:44 -08:00
Dan Wheeler
0e8c7ebc5d unicode->ascii normalization and other count_* tweaks
- wikipedia and wiktionary processing now use unidecode library
  to convert unicode characters to ascii substitutes, for example
  accented e -> e.

- wiktionary counting now disregards words ending in "'s" for
  sufficiently high rank, eliminating many duplicates.
2015-11-09 13:02:07 -08:00
Dan Wheeler
31d35d88d0 build_frequency_lists.py overhaul 2015-11-08 23:52:38 -08:00
Dan Wheeler
bc1f782a2d pull tokenization/normalization/counting logic into a separate file for each raw dictionary source 2015-11-08 17:11:49 -08:00
Dan Wheeler
81ed2c4d55 speed up builds by joining dictionaries in frequency_lists.coffee into
single strings

coffee compilation now takes about 1s instead of 20;
full `npm run build` takes about 2-3s instead of 30.
2015-08-16 23:39:37 -07:00
Dan Wheeler
e20e715093 naming: update for data-scripts 2015-08-07 14:31:13 -07:00
Dan Wheeler
c069580ae7 directory cleanup! .coffee in src, compiled/bundled/minified .js in lib 2015-08-05 11:24:02 -07:00