7.1.3 UTF-16文字列関数｜株式会社きじねこ

今回は、Unicodeの1文字（正確にはコードポイント）を読み書きする関数、そしてUTF-16の文字列の長さを求める関数を定義します。今回も同様、C++11以降かつint型が32ビットの処理系を仮定しています。

まずは関数の定義に必要になる定数を定義します。


// UTF-16の最大値
constexpr char16_t utf16_max = 0xffff;
// Unicodeの最大値
constexpr char32_t unicode_max = 0x10ffff;

// UTF-16の最大値

constexpr char16_t utf16_max = 0xffff;

// Unicodeの最大値

constexpr char32_t unicode_max = 0x10ffff;

それではこれらの定数を使って、それぞれの関数を定義していきます。

Unicodeの1文字をUTF-16の列として書き込む

Unicodeの1文字をOutputIteratorに書き込みます。本来であれば、OutputIteratorのvalue_typeはchar16_tでなければならないのですが、unsigned short型やwchar_t型などでも使えるようにチェックしていません。


template <typename OutputIterator>
OutputIterator utf16_putchar(char32_t c, OutputIterator first, OutputIterator last)
{
  if (first == last)
    throw std::out_of_range(__func__);
  if (c <= utf16_max)
  {
    *first++ = static_cast<char16_t>(c);
  }
  else if (c <= unicode_max)
  {
    *first++ = static_cast<char16_t>((c - 0x10000) / 0x400 + 0xd800);
    if (first == last)
      throw std::out_of_range(__func__);
    *first++ = static_cast<char16_t>((c - 0x10000) % 0x400 + 0xdc00);
  }
  else
  {
    throw std::invalid_argument(__func__);
  }
  return first;
}

template <typename OutputIterator>

OutputIterator utf16_putchar(char32_t c, OutputIterator first, OutputIterator last)

{

if (first == last)

throw std::out_of_range(__func__);

if (c <= utf16_max)

{

*first++ = static_cast<char16_t>(c);

}

else if (c <= unicode_max)

{

*first++ = static_cast<char16_t>((c - 0x10000) / 0x400 + 0xd800);

if (first == last)

throw std::out_of_range(__func__);

*first++ = static_cast<char16_t>((c - 0x10000) % 0x400 + 0xdc00);

}

else

{

throw std::invalid_argument(__func__);

}

return first;

}

UTF-16の列からUnicodeの1文字を読み込む

次は先ほどの逆で、InputIteratorからUnicodeの1文字を読み込みます。先ほど同様、あえてInputIteratorのvalue_typeはチェックしていません。


template <typename InputIterator>
char32_t utf16_getchar(InputIterator& next, InputIterator last)
{
  if (next == last)
    throw std::invalid_argument(__func__);
  char32_t c = *next++;
  auto h = c;
  if (is_high_surrogate(h))
  {
    if (next == last)
      throw std::invalid_argument(__func__);

    auto l = *next++;
    if (!is_low_surrogate(l))
      throw std::invalid_argument(__func__);

    c = 0x10000 + (h - 0xd800) * 0x400 + (l - 0xdc00);
    if (c <= utf16_max || unicode_max < c)
      throw std::invalid_argument(__func__);
  }
  return c;
}

template <typename InputIterator>

char32_t utf16_getchar(InputIterator& next, InputIterator last)

{

if (next == last)

throw std::invalid_argument(__func__);

char32_t c = *next++;

auto h = c;

if (is_high_surrogate(h))

{

if (next == last)

throw std::invalid_argument(__func__);

auto l = *next++;

if (!is_low_surrogate(l))

throw std::invalid_argument(__func__);

c = 0x10000 + (h - 0xd800) * 0x400 + (l - 0xdc00);

if (c <= utf16_max || unicode_max < c)

throw std::invalid_argument(__func__);

}

return c;

}

UTF-16文字列の長さを求める

最後にUTF-16文字列の長さ（コードポイント数）を求める関数を定義します。サロゲートペアの片側しかない場合も1コードポイントとして数えています。


template <typename InputIterator>
std::size_t utf16_length(InputIterator& next, InputIterator last)
{
  std::size_t r = 0;
  char16_t h = 0;

  while (next != last)
  {
    char16_t c = *next++;
    ++r;
    if (is_high_surrogate(c))
    {
      if (next != last && is_low_surrogate(*next))
        ++next;
    }
  }
  return r;
}

template <typename InputIterator>

std::size_t utf16_length(InputIterator& next, InputIterator last)

{

std::size_t r = 0;

char16_t h = 0;

while (next != last)

{

char16_t c = *next++;

++r;

if (is_high_surrogate(c))

{

if (next != last && is_low_surrogate(*next))

++next;

}

return r;

}

元ネタ

↑　7. 文字・文字列・文字コードに関する関数・テンプレート

7.1.3 UTF-16文字列関数

Unicodeの1文字をUTF-16の列として書き込む

UTF-16の列からUnicodeの1文字を読み込む

UTF-16文字列の長さを求める

この記事を書いている人

高木信尚

コメントを残すコメントをキャンセル

Unicodeの1文字をUTF-16の列として書き込む

UTF-16の列からUnicodeの1文字を読み込む

UTF-16文字列の長さを求める

この記事を書いている人

高木信尚

関連記事

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル