This paper proposes additions to Boost to ease use of strings in an international environment. The primary focus is making the Unicode features of C++11 easier to use, although other character encodings are also supported.
These changes are the Boost equivalent of the C++ standard library TR2 proposal N3336, Adapting Standard Library Strings and I/O to a Unicode World.
A proof-of-concept Boost implementation is available at github.com/Beman/string-interoperability
The proposals are presented as a series of problems and their solutions. The higher-level problems are presented before the lower-level problems. The higher-level components are built using some of the proposed lower-level components.
Standard library strings of different character types do not interoperate.
u32string s32; s32 = "foo"; // error! u16string s16(s32); // error! wstring ws(s32.begin(), s32.end()); // error!void f(const string&);f(s32); //error!
The encoding of basic_string instantiations can be determined for the types under discussion. It is either implicit in the string's value_type or can be determined via the locale. See Boost.Filesystem V3 class path for an example of how such interoperability might be achieved.
Experience with Boost.Filesystem V3 class path has demonstrated that string interoperability brings a considerable simplification and improvement to internationalized applications, but that having to provide interoperability without the resolution of the issues presented here is a band-aid. It is being misused, too - users are passing around boost::filesystem::path objects simply to get string encoding interoperability!
The approach is to derive a string class from std::basic_string
and add
overloads to functions most likely to benefit from interoperability. The
overloads are in the form of function templates with sufficient restrictions on
overload resolution participation (i.e. enable_if) that the existing standard
library
functions are always selected if the value type of the argument is the same as
or convertible to the std::basic_string
type's value_type
.
The semantics of the added signatures are the same as original signatures except
that arguments of the template parameter type have their value converted to the
type and encoding of
basic_string::value_type
.
The std::basic_string
functions given additional overloads are:
operator=
, operator+=
,
append
, and assign
signature.template <class T> unspecified_iterator c_str()
,
returning an unspecified iterator with value_type
of T
.
begin()
and end()
. Similar to c_str()
.
To keep the number and complexity of overloads manageable, the
proof-of-concept implementation does not provide any way to specify error
handling policies, or string
and wstring
encoding.
Every one of the added signatures does not need to be able to control error
handling and encoding. The need is particularly rare in environments where UTF-8
is the narrow character encoding and UTF-16 is the wide character encoding. A
subset, possibly just c_str()
, begin()
, and
end()
, with error handling and encoding parameters or arguments, suitable
defaulted, may well be sufficient.
See
<boost/interop/string.hpp>
.
A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:
#include <iostream> int main() { std::cout << U"您好世界"; // error in C++11! }
This code should
"just work", even though the type of U"您好世界"
is const
char32_t*
, not const char*
, as long as the encoding of char
supports 您好世界. Even if those characters are not supported by default encodings,
alternatives like UTF-8 are available.
The code does "just work" with the proof-of-concept implementation of this
proposal. On Linux, with default char
encoding of UTF-8, execution
produces the expected 您好世界 output. On Windows, the console doesn't support full
UTF-8, so the output can be piped to a file or to a program which does handle
UTF-8 correctly. And, yes, that does work correctly with the proof-of-concept
implementation of this proposal.
Add additional function templates to those in 27.7.3.6.4 [ostream.inserters.character],
Character inserter function templates, to cover the case where the
argument character type differs from charT and is not char
,
signed char
, unsigned char
, const char*
,
const signed char*
, or const unsigned char*
. (The
specified types are excluded because they are covered by existing signatures.)
The semantics of the added signatures are the same as original signatures except
that arguments shall be converted to the type and encoding of the stream.
Do the same for the character extractors in 27.7.2.2.3 [istream::extractors], basic_istream::operator>>.
Do the same for the two std::basic_string
inserters and
extractors in 21.4.8.9 [string.io], Inserters and extractors.
See
<boost/interop/stream.hpp>
.
Conversion between character types and their encodings using current standard
library facilities such as std::codecvt
, std::locale
,
and std::wstring_convert
has multiple problems:
codecvt
facets
don't easily compose into a complete conversion from one encoding to another.
Such composition is existing practice in C libraries like ICU. UTF-32 is the
obvious choice for the common encoding to pass between codecs.std::locale
and code conversion, even
when these are implementation details that should be hidden from the
application.The generalization of the std::basic_string
function
c_str
is:
template <class T> unspecified_iterator c_str() const;
Give a std::string
named s8
, this allows a user to
write s8.c_str<char16_t>()
to obtain an iterator with a value type
of char16_t
. To implement this function generically using the
current standard library would be difficult, and would involve the creation of a
temporary sting. The full implementation with the proposed solution is simply:
template <class T> converting_iterator<const_iterator, value_type, by_range, T> c_str() const { return converting_iterator<const_iterator, value_type, by_range, T>(cbegin(), cend()); }
No temporary string is created, and none of the other problems listed above are present either. The solution is generally useful for user defined types, and not just for implementations of the standard library.
Other problems become easier to solve with
converting_iterator.
For example, the Filesystem library's class
path
in
N3239 has many functions with an argument in the form const
codecvt_type& cvt=codecvt()
that could be eliminated by either direct or
indirect use of converting_iterator.
Boost Regex for many years has included a set of Unicode conversion iterators as an implementation detail. Although these do not provide composition, they do demonstrate the technique of using encoding conversion iterators to avoid creation of temporary strings.
This solution is based on the proof-of-concept implementation. Input iterator requirements can probably be loosened to bidirectional, but that hasn't been tested yet.
The preliminaries begin with end-detection policy classes, since strings used null termination, size, or half-open ranges to determine the end of a sequence.
template <class InputIterator> class by_null; template <class InputIterator> class by_size; template <class InputIterator> class by_range;
Codec templates handle actual conversion to and from UTF-32. The primary templates are:
template <class InputIterator, class FromCharT, template<class> class EndPolicy> class to32_iterator; template <class InputIterator, class ToCharT> class from32_iterator;
The Boost library would provide specializations for char
,
wchar_t
, char16_t
, and char32_t
.
Presumably users could provide specializations for UDTs, but that hasn't been
tested yet. The char
and wchar_t
specializations
provide mechanisms to select the encoding. Since this is a new component the
char
default encoding could be UTF-8 rather than locale based and
no existing code would be broken.
The actual converting_iterator
primary template is
simply:
template <class InputIterator, class FromCharT, template<class> class EndPolicy, class ToCharT> class converting_iterator : public from32_iterator<to32_iterator<InputIterator, FromCharT, EndPolicy>, ToCharT> { public: using from32_iterator::from32_iterator; };
Specializations may be provided, but aren't required. The proof-of-concept implementation doesn't use inherited constructors because of lack of compiler support.
The Boost library needs to support C++03 compilers where it is possible to do so with a reasonable amount of effort.
Header
<boost/interop/string_0x.hpp>
provides the typedefs in the table below. This header provides typedefs for C++11 types if present, otherwise provides
typedefs for C++03 equivalent types. By using these typedefs, emulation of these
C++11 features are available for use with C++03 compilers and automatically
switches to use the C++11 features as they are become available.
Typedef name |
C++11 type |
C++03 type |
boost::u16_t |
char16_t |
boost::uint_least16_t |
boost::u32_t |
char32_t |
boost::uint_least16_t |
boost::u16string |
std::u16string |
std::basic_string<boost::u16_t> |
boost::u32string |
std::u32string |
std::basic_string<boost::u32_t> |
The implementation uses the typedefs provided by Microsoft Visual C++ 2010 if present.
The header also provides typedefs for UTF-8 encoded characters and strings. See Problem 5.
This is a purely speculative feature. Its usefulness and practicality are unknown. It will be removed from the proposal if it proves to be problematical.
There is no currently no built-in character type or standard library string type that guarantees UTF-8 encoding.
Without such a string type, neither template arguments nor function overloads
have a way to specify a narrow character with UTF-8 encoding. This is a
confusing inconsistency with char16_t
and char32_t
. It
sends the message to users that UTF-8 encoding is a second class citizen in the
C++ world.
Although indirect detection of encoding via locales does work, it causes confusion and bugs, and is needlessly complex.
Provide the following typedefs in
<boost/interop/string_0x.hpp>
.
Typedef name |
C++11 type |
C++03 type |
boost::u8_t |
unsigned char |
unsigned char |
boost::u8string |
std::basic_string<boost::u8_t> |
std::basic_string<boost::u8_t> |
© Copyright Beman Dawes 2011
Revised 27 January 2012