Apart from being simpler to use this wrapper supports adding ‘//BOM’ to the charset name to either consume or produce a byte order marker.
It also converts invalid byte sequences to (ASCII) escape codes, e.g. \x8F.
Previously calling the function with invalid UTF-8 could cause it to iterate over all the data and, if built in debug mode, could cause an assertion failure.
Now we return the sequence’s end when the data appears to be malformed and we never look at more than the last 6 bytes in the sequence.
This checks if the character needs to be counted as double-width (for soft wrap and similar).
I used the following script to generate the tables, it should be improved to collapse the ranges:
#!/usr/bin/ruby
fixed, start, stop = [ ], [ ], [ ]
open('|curl -Ls http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt') do |io|
io.grep(/^([0-9A-F]+)(?:..([0-9A-F]+))?;[A-Za-z]*W/) do
if $2
start << "0x#$1"
stop << "0x#$2"
else
fixed << "0x#$1"
end
end
end
puts "static uint32_t Fixed[] = { #{fixed.join(', ')} };\n"
puts "static uint32_t RangeBegin[] = { #{start.join(', ')} };\n"
puts "static uint32_t RangeEnd[] = { #{stop.join(', ')} };\n"