Provides easy-to-use automatic, solutions to irritating string interoperability problems, in the style of similar standard library functionality (int/float/double)
STL pointed out that you can stream a narrow string to
std::wcout
, but not a wide string to std::cout
We can write these using locales and codecvt facets. Beman ran into problems with them, because they work on strings, not on iterators.
Beman came up with a generic programming design, and he's looking for feedback on the design. He's very happy of the improvement in the interfaces that he's come up with.
Eric called this "the elephant in the room", that locales don't deal with unicode, nor iterators, etc. He feels that this should be part of a larger intl8n/localization framework.
Howard said that if he had a clean slate, he'd keep code conversions and localization separate.
Bill pointed out that the fundamental problem is that the encoding is not part of the type, and that's one of the things that makes it hard.
Bill also said that "we're in the middle of a decade-long conversion", and there's still a lot of software that uses (say) Big5, and we have to accommodate them.
Bill also said "Given this mess, we should probably pick a couple winners and back them" (regarding character encodings)
Beman notes that in his design, the user (or the vendor) could supply codecs to support other encodings, but he thinks we should supply (native, utf8/16/32)
Howard asked Bill "is this a replacement for wstrconvert
?" Bill
and Beman both said "No", this is a much more generic approach.
Bill points out that we're trying to guess what people are going to want in the future, and we should provide a minimum set.
Beman said that internally, these conversion routines are just a typedef to a
converting iterator, and a call to std::copy
(more from Beman)
Beman said that (referencing Bill) that there's a real need to match up types and encodings.
The suggested (not for inclusion in the standard) solution for this is to
call basic_string
as utf-8.
Marshall suggested that uint8_t instead, but STL and Bill pointed out that this might be the same as 'char' depending on the platform.
Alisdair pointed out that, technically, you can't specialize
std::char_traits
, since unsigned char is
not a user-defined type.
Beman explained there is a (compile-time) mapping between types and codecs.
The key implementation point here is a conversion iterator. They are templated on two codecs, and an iterator whose value_type is one of the supported character type.
Jeffrey relayed input from the ICU people at Google, who say that backwards iteration is useful in some use cases.
The "lingua franca" of all the codecs is UTF-32. Each codec converts to/from a single encoding to UTF-32.
Jeffrey pointed out that you do > 1 character at once with a table-based method, and get significantly better performance.
Beman responded that implementations can specialize individual conversions.
Eric opined that irregardless, these conversions are useful.
Alisdair is concerned about the name "from_iterator"; Beman said he's not wedded to any of those names. We decided not to go down that road at this time.
Alisdair asked whether or not the single-param version of
conversion_iterator
should be explicit.
Marshall asked if there were two utf-16 codecs, one for big-endian and one for little-endian. Beman said that this was converting to utf-16 characters in the native endianness. Jeffrey pointed out that you could write a conversion iterator to change to different endianness if you wanted, and Beman agreed.
Looking at the codec definition, there are two nested template classes. Alisdair asked if Beman had considered using template aliases here. Beman said NO, because he's not that experienced with C++11 yet; he might in the future.
Alisdair, looking at the codec definition, asked about requirements on the from_iterator and to_iterator. Beman pointed at the text, which describes the requirements.
Jeffrey suggested that we need a paper naming the iterator concepts that InputIterators require, since Beman uses that here.
Eric asked why you can construct a from_iterator from two different iterator types. Beman said that it was to disambiguate from (iter, size_t) constructor. STL says that this is not the right fix, and gives std::vector as an example of a better solution.
Beman asked Howard "why don't we have an is_iterator" type trait? STL says that it's hard; "not an integral type" (at least).
Jeffrey relayed info from ICU people that there are some encodings (not common) where they put the bytes in a different order than unicode.
Beman said that these codecs are similar (in principle) to uchar.h, where a codec can consume > 1 characters before producing output. Marshall said that as long as the storage needed for each conversion is bounded, then there shouldn't be a problem. Jeffrey agreed.
Eric pointed out that combining characters (u + combining umlaut, say) can not necessarily be handled; what happens if you get a 'u' as the last character in the output? Beman and Bill say that this behavior is already defined via Unicode, and Beman referred to the error handling sections of his proposal. Beman pointed out that the "flush" operation is called "dereference", and that made Eric happy.
Beman opined that we really made a big mistake in C++11 in introducing UTF-8 literals w/o having a UTF-8 character type.
Jeffrey said "You've been very careful to make the default the system encoding, and made it really easy to get UTF-8. Can we switch that? Make the default UTF-8, and make it easy to get native encoding?" Discussion ensued about the relative popularity, and trend lines, w/o any resolution.
Alisdair asked about deducible template arguments in copy_string, and Beman said that there was an error in that code - template params are out of order.
Marshall says (a) he likes it (b) he wants to see results from more systems (c) and offered to help with that.
Jeffrey says (relaying from his ICU folks) (a) likes the structures, (b) good for string conversion, (c) think that the general mechanism is overkill, (d) don't care much for support for system encoding, and there should be a solution that discourages conversions. (i.e, live in Unicode).
Beman pointed out that there is a null conversions.
STL asks if you can leave the conversion method unspecified and decide on either code-point at a time or whole-string.
STL says that the barrier to Unicode is that it's hard to convert; and anything that makes it easy is better.
Jeffrey's people would like to see just string conversions, rather than a general-purpose conversion facility.
Alisdair says: Ideally, I would like to just deal with Unicode, and never deal with anything else. But I can't just deal with Unicode, so I want this.
Eric says: I want this too, but I'm concerned about some of the ICU concerns, and he wonders if removing the extensibility from the proposal would simplify it (and remove confusion for users). Beman asked what that would mean - would each implementer have its' own interface for the codecs?
Steven says that if it's string at a time, there's a concept of a locale or code page, and there's usually an API to convert between strings / code-page.
Beman says that people don't like to generate a lot of temporaries, and on some systems, the number of temporaries can determine the performance of the system.
Alisdair really wants to specify the user-extensible mechanism here. This will give users a disincentive to "roll their own". Eric wants the interface too, but he's just not 100% sure that this covers all of the cases. Jeffrey says that he thinks that it does, given that it is ICU's fallback mechanism.
Alisdair: Let's poll.
Does the committee have an interest in delivering this interface (including codecs) that we saw today.
SF WF WA SA 2 4 0 2
SF = Strongly in favor
WF = Weakly in favor
WA = Weakly against
SA = Strongly against