[sugj-tech:7415] Samba の内部文字コード固定化について (Re: CH_DISPLAY and gettext)

2011年 6月 24日 (金) 02:51:42 JST

たかはしもとのぶです。

日記の方にも簡単に書きましたが、ここにきてようやくというべきか、Samba
本家の側から Samba の内部文字コードを unix charset パラメータで指定す
る方式から固定にしませんかという議論が出てきています。

・[Samba]7年ぶりに再燃!? - Sambaの内部文字コード議論
 <http://damedame.monyo.com/?date=20110624#p01>

UTF-8 か UTF-16 かという議論はありますが、いずれにしても内部文字コード
が固定化されることは大きなメリットがあると考えています。

ただ、現時点では議論が提起されたところですので、どこに落ち着くか、まだ
まだ予断を許さない状態だと考えています。

ということで、是非みなさまの応援をよろしくお願いできればと思います。

From: Michael Adam <obnox ＠ samba.org>
Subject: Re: CH_DISPLAY and gettext
Date: Thu, 23 Jun 2011 15:04:27 +0200

> I have some points of criticism with CH_UNIX used as charset to
> internally store strings (file names, user names, etc) in memory
> as well as in databases. I am sure that there have been very good
> reasons for introducing CH_UNIX as internal encoding in the past,
> but I am questioning this anyways:
> 
> 1) This yields information too early!
>    The mapping Unicode --> CH_UNIX is potentially lossy.
>    E.g. if I use ASCII or some latin/iso charset, then some characters
>    will not be displayable. Maybe even unmarshalling will fail
>    so users will not be available, depending on the value of CH_UNIX.
> 
> 2) Storing our internal databases (s3 eg: group mapping, passdb)
>    in CH_UNIX is a very bad thing: This encoding might be changed
>    by the administrators and the databases are not coverted
>    automatically. Neither is the file system but there is convmv
>    for this. But for the internal DBs there is not even a
>    conversion tools. I have to look which other databases are
>    stored in which encoding, especially samba4.
> 
>    I have been in quite cumbersome manual db repair due to this
>    problem more than once already. This was really bad!
> 
> In order to fix #2, there are two options:
> 
> a) Change the dbs (individually) to convert from internal
>    representation to UTF8 (or UTF16 maybe), before storing.
> 
> b) change samba to internally store everyhting in UTF8
>    and then write out the DBs unchanged.
>    For every target that needs a special encoding (like
>    the file system needing CH_UNIX), we'd then need to convert
>    before accessing the target (like I detailed in my
>    previous emails).
> 
> In either case we also need a encoding conversion tool for each
> such database, since afaik we can not reliably autodetect
> the encoding of the stored data.
> 
> In order to fix #1 though, option (b) is the only possible way.
> 
> 
> So my wish would be to convert all of samba to use UTF8
> internally (I'd be ready to discuss a different unicode
> charset like UTF16), and convert to CH_UNIX for the necessary
> communication interfaces with the outside.
> 
> 
> I hope this makes my argument a little clearer.
> 
> Cheers - Michael

---
TAKAHASHI Motonobu <monyo ＠ monyo.com> / @damemonyo
  http://damedame.monyo.com/ / http://facebook.com/monyot