utf8towcr(3) - Manual pages

NAME

mbintowcr, mbintowcr_l, utf8towcr, wcrtombin, wcrtombin_l, wcrtoutf8 — 8-bit-clean wchar conversion w/escaping or validation

SYNOPSIS

#include <wchar.h>

size_t
mbintowcr(wchar_t * restrict dst, const char * restrict src, size_t dlen, size_t *slen, int flags);

size_t
utf8towcr(wchar_t * restrict dst, const char * restrict src, size_t dlen, size_t *slen, int flags);

size_t
wcrtombin(char * restrict dst, const wchar_t * restrict src, size_t dlen, size_t *slen, int flags);

size_t
wcrtoutf8(char * restrict dst, const wchar_t * restrict src, size_t dlen, size_t *slen, int flags);

#include <xlocale.h>

size_t
mbintowcr_l(wchar_t * restrict dst, const char * restrict src, size_t dlen, size_t *slen, locale_t locale, int flags);

size_t
wcrtombin_l(char * restrict dst, const wchar_t * restrict src, size_t dlen, size_t *slen, locale_t locale, int flags);

The mbintowcr() and wcrtombin() functions translate byte data into wide-char format and back again. Under normal conditions (but not with all flags) these functions guarantee that the round-trip will be 8-bit-clean. Some care must be taken to properly specify the WCSBIN_EOF flag to properly handle trailing incomplete sequences at stream EOF.

For the "C" locale these functions are 1:1 (do not convert UTF-8). For UTF-8 locales these functions convert to/from UTF-8. Most of the discussion below pertains to UTF-8 translations.

The utf8towcr() and wcrtoutf8() functions do exactly the same thing as the above functions but are locked to the UTF-8 locale. That is, these functions work regardless of which localehas been selected and also do not require any initial setlocale() call to initialize. Applications working explicitly in UTF-8 should use these versions.

Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF). Illegal sequences include surrogate-space encodings, non-canonical encodings, codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not legal any more), and malformed codings. Flags may be used to modify this behavior.

The mbintowcr() function takes generic 8-bit byte data as its input which the caller expects to be loosely coded in UTF-8 and converts it to an array of wchar_t, and returns the number of wchar_t that were converted. The caller must set *slen to the number of bytes in the input buffer and the function will set *slen on return to the number of bytes in the input buffer that were processed.

Fewer bytes than specified might be processed due to the output buffer reaching its limit or due to an incomplete sequence at the end of the input buffer when the WCSBIN_EOF flag has not been specified.

If processing a stream, the caller typically copies any unprocessed data at the end of the buffer back to the beginning and then continues loading the buffer from there. Be sure to check for an incomplete translation at stream EOF and do a final translation of the remainder with the WCSBIN_EOF flag set.

This function will always generate escapes for illegal UTF-8 code sequences and by can produce a clean BYTE-WCHAR-BYTE conversion. See the flags description later on.

This function cannot return an error unless the WCSBIN_STRICT flag is set. In case of error, any valid conversions are returned first and the caller is expected to iterate. The error is returned when it becomes the first element of the buffer.

A NULL destination buffer may be specified in which case this function operates identically except for actually trying to fill the buffer. This feature is typically used for validation with WCSBIN_STRICT and sometimes also used in combination with WCSBIN_SURRO (set if you want to allow surrogates).

The wcrtombin() function takes an array of wchar_t as its input which is usually expected to be well-formed and converts it to an array of generic 8-bit byte data. The caller must set *slen to the number of elements in the input buffer and the function will set *slen on return to the number of elements in the input buffer that were processed.

Be sure to properly set the WCSBIN_EOF flag for the last buffer at stream EOF.

This function can return an error regardless of the flags if a supplied wchar code is out of range. Some flags change the range of allowed wchar codes. In case of error, any valid conversions are returned first and the caller is expected to iterate. The error is returned when it becomes the first element of the buffer.

A NULL destination buffer may be specified in which case this function operates identically except for actually trying to fill the buffer. This feature is typically used for validation with or without WCSBIN_STRICT and sometimes also used in combination with WCSBIN_SURRO.

One final note on the use of WCSBIN_SURRO for wchars-to-bytes. If this flag is not set surrogates in the escape range will be de-escaped (giving us our 8-bit-clean round-trip), and other surrogates will be passed through as UTF-8 encodings. In WCSBIN_STRICT mode this flag works slightly differently. If not specified no surrogates are allowed at all (escaped or otherwise), and if specified all surrogates are allowed and will never be de-escaped.

The _l-suffixed versions of mbintowcr() and wcrtombin() take an explicit locale argument, whereas the non-suffixed versions use the current global or per-thread locale.

UTF-8B ESCAPE SEQUENCES

Escaping is handled by converting one or more bytes in the byte sequence to the UTF-8B escape wchar (U+DC80 - U+DCFF). Most illegal sequences escape the first byte and then reprocess the remaining bytes. An illegal byte sequence length (5 or 6 bytes), non-canonical encoding, or illegal wchar value (beyond 0x10FFFF if not modified by flags) will escape all bytes in the sequence as long as they were not malformed.

When converting back to a byte-sequence, if not modified by flags, UTF-8B escape wchars are converted back to their original bytes. Other surrogate codes (U+D800 - U+DFFF which are normally illegal) will be passed through and encoded as UTF-8.

FLAGS

WCSBIN_EOF

Indicate that the input buffer represents the last of the input stream. This causes any partial sequences at the end of the input buffer to be processed.

WCSBIN_SURRO

This flag passes-through any surrogate codes that are already UTF-8-encoded. This is normally illegal but if you are processing a stream which has already been UTF-8B escaped this flag will prevent the U+DC80 - U+DCFF codes from being re-escaped bytes-to-wchars and will prevent decoding back to the original bytes wchars-to-bytes. This flag is sometimes used on input if the caller expects the input stream to already be escaped, and not usually used on output unless the caller explicitly wants to encode to an intermediate illegal UTF-8 encoding that retains the escapes as escapes.

This flag does not prevent additional escapes from being translated on bytes-to-wchars (WCSBIN_STRICT prevents escaping on bytes-to-wchars), but will prevent de-escaping on wchars-to-bytes.

This flag breaks round-trip 8-bit-clean operation since escape codes use the surrogate space and will mix with surrogates that are passed through on input by this flag in a way that cannot be distinguished.

WCSBIN_LONGCODES

Specifying this flag in the bytes-to-wchars direction allows for decoding of legacy 5-byte and 6-byte sequences as well as 4-byte sequences which would normally be illegal. These sequences are illegal and this flag should not normally be used unless the caller explicitly wants to handle the legacy case.

Specifying this flag in the wchars-to-bytes direction allows normally illegal wchars to be encoded. Again, not recommended.

This flag does not allow decoding non-canonical sequences. Such sequences will still be escaped.

WCSBIN_STRICT

This flag forces strict parsing in the bytes-to-wchars direction and will cause mbintowcr() to process short or return with an error once processing reaches the illegal coding rather than escaping the illegal sequence. This flag is usually specified only when the caller desires to validate a UTF-8 buffer. Remember that an error may also be present with return values greater than 0. A partial sequences at the end of the buffer is not considered to be an error unless WCSBIN_EOF is also specified.

Caller is reminded that when using this feature for validation, a short-return can happen rather than an error if the error is not at the base of the source or if WCSBIN_EOF is not specified. If the caller is not chaining buffers then WCSBIN_EOF should be specified and a simple check of whether *slen equals the original input buffer length on return is sufficient to determine if an error occurred or not. If the caller is chaining buffers WCSBIN_EOF is not specified and the caller must proceed with the copy-down / continued buffer loading loop to distinguish between an incomplete buffer and an error.

RETURN VALUES

The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l() and wcrtoutf8() functions return the number of output elements generated and set *slen to the number of input elements converted. If an error occurs but the output buffer has already been populated, a short return will occur and the next iteration where the error is the first element will return the error. The caller is responsible for processing any error conditions before continuing.

The mbintowcr(), mbintowcr_l() and utf8towcr() functions can return a (size_t)-1 error if WCSBIN_STRICT is specified, and otherwise cannot.

The wcrtombin(), wcrtombin_l() and wcrtoutf8() functions can return a (size_t)-1 error if given an illegal wchar code, as modified by flags. Any wchar code >= 0x80000000U always causes an error to be returned.

ERRORS

If an error is returned, errno will be set to EILSEQ.

STANDARDS

The mbintowcr(), mbintowcr_l(), utf8towcr(), wcrtombin(), wcrtombin_l() and wcrtoutf8() functions are non-standard extensions to libc.

man.bsd.lv manual page server