Tíquete #38189

${#var} behaviour does not change along with locale

: 2018-04-12 06:33 Última Atualização: 2018-04-15 11:07

Relator:
Dono:
Tipo:
Estado:
Fechado
Componente:
Marcos:
(Nenhum)
Prioridade:
5 - Medium
Gravidade:
5 - Medium
Resolução:
Rejected
Arquivo:
Nenhum
Vote
Score: 0
No votes
0.0% (0/0)
0.0% (0/0)

Details

Assuming a UTF-8 locale:

$ yash -c 'v=é; echo ${#v}; LC_CTYPE=C; echo ${#v}'
1
1

Since that is a two-byte character, I'd expect the second '1' to be '2', like in bash, ksh93 and zsh:

$ bash -c 'v=é; echo ${#v}; LC_CTYPE=C; echo ${#v}'
1
2

I think that behaviour is more useful as it allows counting bytes by temporarily setting LC_ALL=C.

I'm not sure whether to file this under "Bugs" or "Feature Requests". I'm leaning towards "Bugs" as it seems to me that ${#var} should count the characters in the currently set locale, which is not necessarily the same locale that was set when the shell was initialised.

Ticket History (3/4 Histories)

2018-04-12 06:33 Updated by: mcdutchie
  • New Ticket "${#var} behaviour does not change along with locale" created
2018-04-14 12:39 Updated by: magicant
  • Details Updated
Comentário

In terms of POSIXly correctness, the current behaviour is expected.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_05_03

Changing the value of LC_CTYPE after the shell has started shall not affect the lexical processing of shell commands in the current shell execution environment or its subshells. Invoking a shell script or performing exec sh subjects the new shell to the changes in LC_CTYPE.

As an exception, yash reflects changes of LC_CTYPE if it is in the interactive mode but not in the POSIXly correct mode. Some users have to set LC_CTYPE in their .yash_profile or .yashrc because the system might not be configured so that the shell is invoked with LC_CTYPE having the value the users want.

Regarding character processing, yash handles almost all strings as "wide character sequences" rather than "byte sequences". LC_CTYPE affects how characters are converted between wide characters and bytes. However, changing LC_CTYPE in interactive yash does not convert characters between the locales, so there is no guarantee that characters are handled as expected after changing LC_CTYPE. To convert characters between locales, you need to use a dedicated converter like "iconv".

Bytes are converted to wide characters even in the C locale, so it is important to note that not all bytes are accepted in the C locale. Typically non-ASCII characters are rejected or ignored as illegal. You should not rely on the C locale to count the number of raw bytes in a string.

2018-04-15 01:53 Updated by: mcdutchie
Comentário

Thanks for the information! I withdraw my report.

2018-04-15 11:07 Updated by: magicant
  • Resolução Update from Nenhum to Rejected
  • Estado Update from Aberto to Fechado

Attachment File List

No attachments

Editar

You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » Login