Proposals for Improved String Interoperability in a Unicode World

Introduction

This paper proposes additions to Boost to ease use of strings in an international environment. The primary focus is making the Unicode features of C++11 easier to use, although other character encodings are also supported.

These changes are the Boost equivalent of the C++ standard library TR2 proposal N3336, Adapting Standard Library Strings and I/O to a Unicode World.

A proof-of-concept Boost implementation is available at github.com/Beman/string-interoperability


Proposed Boost Components

The proposals are presented as a series of problems and their solutions. The higher-level problems are presented before the lower-level problems. The higher-level components are built using some of the proposed lower-level components.

Problem 1: Standard strings don't interoperate if encoding differs

Discussion

Standard library strings of different character types do not interoperate.

Examples

u32string s32;
s32 = "foo";         // error!
u16string s16(s32);  // error!
wstring ws(s32.begin(), s32.end()); // error!
void f(const string&);
f(s32);  //error!

The encoding of basic_string instantiations can be determined for the types under discussion. It is either implicit in the string's value_type or can be determined via the locale.  See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.

Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized applications, but that having to provide interoperability without the resolution of the issues presented here is a band-aid. It is being misused, too - users are passing around boost::filesystem::path objects simply to get string encoding interoperability!

Proposed Boost component

The approach is to derive a string class from std::basic_string and add overloads to functions most likely to benefit from interoperability. The overloads are in the form of function templates with sufficient restrictions on overload resolution participation (i.e. enable_if) that the existing standard library functions are always selected if the value type of the argument is the same as or convertible to the std::basic_string type's value_type. The semantics of the added signatures are the same as original signatures except that arguments of the template parameter type have their value converted to the type and encoding of basic_string::value_type.

The std::basic_string functions given additional overloads are:

To keep the number and complexity of overloads manageable, the proof-of-concept implementation does not provide any way to specify error handling policies, or string and wstring encoding. Every one of the added signatures does not need to be able to control error handling and encoding. The need is particularly rare in environments where UTF-8 is the narrow character encoding and UTF-16 is the wide character encoding. A subset, possibly just c_str(), begin(), and end(), with error handling and encoding parameters or arguments, suitable defaulted, may well be sufficient.

See <boost/interop/string.hpp>.

Problem 2: Strings don't interoperate with I/O streams if encoding differs

Discussion

A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:

#include <iostream>
int main()
{
  std::cout << U"您好世界";   // error in C++11!
}

This code should "just work", even though the type of U"您好世界" is const char32_t*, not const char*, as long as the encoding of char supports 您好世界. Even if those characters are not supported by default encodings, alternatives like UTF-8 are available.

The code does "just work" with the proof-of-concept implementation of this proposal. On Linux, with default char encoding of UTF-8, execution produces the expected 您好世界 output. On Windows, the console doesn't support full UTF-8, so the output can be piped to a file or to a program which does handle UTF-8 correctly. And, yes, that does work correctly with the proof-of-concept implementation of this proposal.

Proposed Boost component

Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character], Character inserter function templates, to cover the case where the argument character type differs from charT and is not char, signed char, unsigned char, const char*, const signed char*, or const unsigned char*.  (The specified types are excluded because they are covered by existing signatures.) The semantics of the added signatures are the same as original signatures except that arguments shall be converted to the type and encoding of the stream.

Do the same for the character extractors in 27.7.2.2.3 [istream::extractors], basic_istream::operator>>.

Do the same for the two std::basic_string inserters and extractors in 21.4.8.9 [string.io], Inserters and extractors.

See <boost/interop/stream.hpp>.

Problem 3: String conversion iterators are not provided

Discussion

Conversion between character types and their encodings using current standard library facilities such as std::codecvt, std::locale, and std::wstring_convert has multiple problems:

Example

The generalization of the std::basic_string function c_str is:

template <class T> unspecified_iterator c_str() const;

Give a std::string named s8, this allows a user to write s8.c_str<char16_t>() to obtain an iterator with a value type of char16_t.  To implement this function generically using the current standard library would be difficult, and would involve the creation of a temporary sting. The full implementation with the proposed solution is simply:

template <class T>
converting_iterator<const_iterator, value_type, by_range, T> c_str() const
{
  return converting_iterator<const_iterator,
    value_type, by_range, T>(cbegin(), cend());
}

No temporary string is created, and none of the other problems listed above are present either. The solution is generally useful for user defined types, and not just for implementations of the standard library.

Other problems become easier to solve with converting_iterator. For example, the Filesystem library's class path in N3239 has many functions with an argument in the form const codecvt_type& cvt=codecvt() that could be eliminated by either direct or indirect use of converting_iterator.

Existing practice

Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.

Proposed Boost component

This solution is based on the proof-of-concept implementation. Input iterator requirements can probably be loosened to bidirectional, but that hasn't been tested yet.

The preliminaries begin with end-detection policy classes, since strings used null termination, size, or half-open ranges to determine the end of a sequence.

template <class InputIterator> class by_null;
template <class InputIterator> class by_size;
template <class InputIterator> class by_range;

Codec templates handle actual conversion to and from UTF-32. The primary templates are:

template <class InputIterator, class FromCharT, template<class> class EndPolicy> 
  class to32_iterator;
template <class InputIterator, class ToCharT> 
  class from32_iterator;

The Boost library would provide specializations for char, wchar_t, char16_t, and char32_t. Presumably users could provide specializations for UDTs, but that hasn't been tested yet. The char and wchar_t specializations provide mechanisms to select the encoding. Since this is a new component the char default encoding could be UTF-8 rather than locale based and no existing code would be broken.

The actual converting_iterator primary template is simply:

template <class InputIterator, class FromCharT, template<class> class EndPolicy,
          class ToCharT> 
class converting_iterator
  : public from32_iterator<to32_iterator<InputIterator, FromCharT, EndPolicy>,
      ToCharT>
{
public:
  using from32_iterator::from32_iterator;
};

Specializations may be provided, but aren't required. The proof-of-concept implementation doesn't use inherited constructors because of lack of compiler support.

Problem 4:  C++11 feature emulation for C++03 compilers

Discussion

The Boost library needs to support C++03 compilers where it is possible to do so with a reasonable amount of effort.

Proposed Boost component

Header <boost/interop/string_0x.hpp> provides the typedefs in the table below.  This header provides typedefs for C++11 types if present, otherwise provides typedefs for C++03 equivalent types. By using these typedefs, emulation of these C++11 features are available for use with C++03 compilers and automatically switches to use the C++11 features as they are become available.

Typedef
name
C++11
type
C++03
type
boost::u16_t char16_t boost::uint_least16_t
boost::u32_t char32_t boost::uint_least16_t
boost::u16string std::u16string std::basic_string<boost::u16_t>
boost::u32string std::u32string std::basic_string<boost::u32_t>

The implementation uses the typedefs provided by Microsoft Visual C++ 2010 if present.

The header also provides typedefs for UTF-8 encoded characters and strings. See Problem 5.

Problem-5: No string type with UTF-8 encoding guarantee

This is a purely speculative feature. Its usefulness and practicality are unknown. It will be removed from the proposal if it proves to be problematical.

Discussion

There is no currently no built-in character type or standard library string type that guarantees UTF-8 encoding.

Without such a string type, neither template arguments nor function overloads have a way to specify a narrow character with UTF-8 encoding. This is a confusing inconsistency with char16_t and char32_t. It sends the message to users that UTF-8 encoding is a second class citizen in the C++ world.

Although indirect detection of encoding via locales does work, it causes confusion and bugs, and is needlessly complex.

Proposed Boost component

Provide the following typedefs in <boost/interop/string_0x.hpp>.

Typedef
name
C++11
type
C++03
type
boost::u8_t unsigned char unsigned char
boost::u8string std::basic_string<boost::u8_t> std::basic_string<boost::u8_t>

 


© Copyright Beman Dawes 2011

Revised 27 January 2012