12 Commits

Author SHA1 Message Date
Allan Odgaard
0520e4fe88 Add text::transcode_t which is a wrapper for iconv
Apart from being simpler to use this wrapper supports adding ‘//BOM’ to the charset name to either consume or produce a byte order marker.

It also converts invalid byte sequences to (ASCII) escape codes, e.g. \x8F.
2016-06-21 18:31:29 +02:00
Allan Odgaard
7675aeb4ec Changing case would truncate the result if it grew in size 2014-06-28 17:42:22 +02:00
Allan Odgaard
cf452cdcee Increase the number of tests for sanitizing UTF-8
Also harmonize the formatting of the existing tests.
2014-04-01 16:01:19 +07:00
Allan Odgaard
1840f5b7fa Improve utf8::find_safe_end implementation
Previously calling the function with invalid UTF-8 could cause it to iterate over all the data and, if built in debug mode, could cause an assertion failure.

Now we return the sequence’s end when the data appears to be malformed and we never look at more than the last 6 bytes in the sequence.
2014-03-03 13:48:12 +07:00
Allan Odgaard
c2397484b8 Use C++11 for loop
Majority of the edits done using the following ruby script:

    def update_loops(src)
      dst, cnt = '', 0

      block_indent, variable = nil, nil
      src.each_line do |line|
        if block_indent
          if line =~ /^#{block_indent}([{}\t])|^\t*$/
            block_indent = nil if $1 == '}'
            line = line.gsub(%r{ ([^a-z>]) \(\*#{variable}\) | \*#{variable}\b | \b#{variable}(->) }x) do
              $1.to_s + variable + ($2 == "->" ? "." : "")
            end
          else
            block_indent = nil
          end
        elsif line =~ /^(\t*)c?iterate\((\w+), (?!diacritics::make_range)(.*\))$/
          block_indent, variable = $1, $2
          line = "#$1for(auto const& #$2 : #$3\n"
          cnt += 1
        end
        dst << line
      end
      return dst, cnt
    end

    paths.each do |path|
      src = IO.read(path)

      cnt = 1
      while cnt != 0
        src, cnt = update_loops(src)
        STDERR << "#{path}: #{cnt}\n"
      end

      File.open(path, "w") { |io| io << src }
    end
2014-03-03 10:34:13 +07:00
Allan Odgaard
2fa5d7ddb2 Add UTF-8 sanitization function
This can be used to remove malformed multibyte sequences.
2013-10-08 21:59:54 +02:00
Allan Odgaard
b7bc35ed9d Let decode::url_part convert plus to space 2013-08-29 13:26:16 +02:00
Allan Odgaard
f05426378c Update testing system for text framework 2013-07-26 13:53:58 +02:00
Allan Odgaard
20378c426e A full match should rank highest 2013-01-18 13:34:57 +01:00
Allan Odgaard
ebab500ba3 Use std::map/set instead of C arrays
These types come with a find() method and avoids having to use helper functions to get the begin/end of the array (for linear search).
2012-09-20 12:22:20 +02:00
Allan Odgaard
45f847d01e Add text::is_east_asian_width
This checks if the character needs to be counted as double-width (for soft wrap and similar).

I used the following script to generate the tables, it should be improved to collapse the ranges:

    #!/usr/bin/ruby

    fixed, start, stop = [ ], [ ], [ ]
    open('|curl -Ls http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt') do |io|
      io.grep(/^([0-9A-F]+)(?:..([0-9A-F]+))?;[A-Za-z]*W/) do
        if $2
          start << "0x#$1"
          stop << "0x#$2"
        else
          fixed << "0x#$1"
        end
      end
    end
    puts "static uint32_t Fixed[]      = { #{fixed.join(', ')} };\n"
    puts "static uint32_t RangeBegin[] = { #{start.join(', ')} };\n"
    puts "static uint32_t RangeEnd[]   = { #{stop.join(', ')} };\n"
2012-08-18 21:29:05 +02:00
Allan Odgaard
9894969e67 Initial commit 2012-08-09 16:25:56 +02:00