Commit Graph

25 Commits

Author SHA1 Message Date
Allan Odgaard
7d0100fa2b utf16::distance: End iterator may point into multi-byte character
This function is (indirectly) used by a lot of code, and not all of it provide with valid indexes, though it seems like an issue that can be fixed locally, hence why I have decided to allow it (coupled with this being the main reason for crashes).

It is however still not allowed when building in debug mode (rationale being that running it in debug mode and getting an assertion failure should provide enough info to fix the issue).
2014-04-05 14:13:39 +07:00
Allan Odgaard
0daa6d0ec2 Tighter code for removing malformed UTF-8 sequences 2014-04-01 16:01:19 +07:00
Allan Odgaard
cf452cdcee Increase the number of tests for sanitizing UTF-8
Also harmonize the formatting of the existing tests.
2014-04-01 16:01:19 +07:00
Allan Odgaard
471fbe45c2 Do not stack allocate potentially large buffer
Also test that each system function used actually succeeds.
2014-04-01 16:01:19 +07:00
Allan Odgaard
d7660bd89e Detect first loop iteration using std::exchange “idiom” 2014-03-23 22:47:15 +07:00
Allan Odgaard
f3f4efd062 Use binary literals in code (C++14) 2014-03-16 18:06:03 +07:00
Allan Odgaard
1840f5b7fa Improve utf8::find_safe_end implementation
Previously calling the function with invalid UTF-8 could cause it to iterate over all the data and, if built in debug mode, could cause an assertion failure.

Now we return the sequence’s end when the data appears to be malformed and we never look at more than the last 6 bytes in the sequence.
2014-03-03 13:48:12 +07:00
Allan Odgaard
c2397484b8 Use C++11 for loop
Majority of the edits done using the following ruby script:

    def update_loops(src)
      dst, cnt = '', 0

      block_indent, variable = nil, nil
      src.each_line do |line|
        if block_indent
          if line =~ /^#{block_indent}([{}\t])|^\t*$/
            block_indent = nil if $1 == '}'
            line = line.gsub(%r{ ([^a-z>]) \(\*#{variable}\) | \*#{variable}\b | \b#{variable}(->) }x) do
              $1.to_s + variable + ($2 == "->" ? "." : "")
            end
          else
            block_indent = nil
          end
        elsif line =~ /^(\t*)c?iterate\((\w+), (?!diacritics::make_range)(.*\))$/
          block_indent, variable = $1, $2
          line = "#$1for(auto const& #$2 : #$3\n"
          cnt += 1
        end
        dst << line
      end
      return dst, cnt
    end

    paths.each do |path|
      src = IO.read(path)

      cnt = 1
      while cnt != 0
        src, cnt = update_loops(src)
        STDERR << "#{path}: #{cnt}\n"
      end

      File.open(path, "w") { |io| io << src }
    end
2014-03-03 10:34:13 +07:00
Allan Odgaard
bc4650f2b0 Move line ending support to text framework 2013-10-31 18:32:16 +01:00
Allan Odgaard
2fa5d7ddb2 Add UTF-8 sanitization function
This can be used to remove malformed multibyte sequences.
2013-10-08 21:59:54 +02:00
Allan Odgaard
1c308c810d Use map::emplace instead of inserting std::pair (C++11) 2013-09-05 20:59:11 +02:00
Allan Odgaard
b7bc35ed9d Let decode::url_part convert plus to space 2013-08-29 13:26:16 +02:00
Allan Odgaard
7ccd7add60 Use digittoint() instead of std::stoi()
Both because of performance and because the latter can throw an exception (although we check the input, so it should not happen with our use of the API).
2013-08-27 15:30:09 +02:00
Allan Odgaard
585a32344a Allow comparison of text::indent_t 2013-07-29 10:03:25 +02:00
Allan Odgaard
f05426378c Update testing system for text framework 2013-07-26 13:53:58 +02:00
Allan Odgaard
ea2cf8d875 Add CR to default trim character set 2013-06-22 21:02:45 +07:00
Allan Odgaard
fd60fd25c7 Change strtol → std::stol (C++11)
I initially wanted to do this change globally, but std::stoX will throw an exception if it fails to parse something and we use strtoX a few places where parsing nothing (and getting back zero) is fine.
2013-02-08 11:20:35 +01:00
Allan Odgaard
e75e7ec8e5 Change text::format → std::to_string (C++11) 2013-02-08 11:20:34 +01:00
Allan Odgaard
20378c426e A full match should rank highest 2013-01-18 13:34:57 +01:00
Allan Odgaard
ebab500ba3 Use std::map/set instead of C arrays
These types come with a find() method and avoids having to use helper functions to get the begin/end of the array (for linear search).
2012-09-20 12:22:20 +02:00
Allan Odgaard
39f0ea518b Use tuple instead of lexicographical_compare 2012-09-20 12:22:20 +02:00
Allan Odgaard
cbe91ff831 Assume compiler support for explicit keyword
Since we require a fairly recent clang for other features, there is no reason to test for this one.
2012-08-29 14:27:35 +02:00
Allan Odgaard
be63bda3e7 Support East Asian Width
There is a bunch of functions that deal with the logical column count and these now all count code points with the “east asian width” (unicode) property set as two columns.

This closes issue #206.
2012-08-18 21:29:05 +02:00
Allan Odgaard
45f847d01e Add text::is_east_asian_width
This checks if the character needs to be counted as double-width (for soft wrap and similar).

I used the following script to generate the tables, it should be improved to collapse the ranges:

    #!/usr/bin/ruby

    fixed, start, stop = [ ], [ ], [ ]
    open('|curl -Ls http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt') do |io|
      io.grep(/^([0-9A-F]+)(?:..([0-9A-F]+))?;[A-Za-z]*W/) do
        if $2
          start << "0x#$1"
          stop << "0x#$2"
        else
          fixed << "0x#$1"
        end
      end
    end
    puts "static uint32_t Fixed[]      = { #{fixed.join(', ')} };\n"
    puts "static uint32_t RangeBegin[] = { #{start.join(', ')} };\n"
    puts "static uint32_t RangeEnd[]   = { #{stop.join(', ')} };\n"
2012-08-18 21:29:05 +02:00
Allan Odgaard
9894969e67 Initial commit 2012-08-09 16:25:56 +02:00