Gentoo Wiki ArchivesGentoo Wiki

UTF-8

Image:Gentoo_tux.png

Base Install

  1. What is Gentoo?
  2. About
  3. Preparation
  4. Partitioning
  5. Configuring
  6. Stage Progression
  7. Kernel
  8. Bootloader
  9. Test
  10. Converting from or to a non-Gentoo distribution
  11. Troubleshooting
  12. Maintaining

Configuration

Base Extras

Server

Desktop

Other Articles

edit

Contents

About

Because computers store information only in bits of zeros and ones, characters have to be represented by a string of bits and translated back and forth using "character tables". To conserve memory, each character should be made up of as few bits as possible.

The drawback is that this limits the number of characters that can be represented by the table. As long as the table contains all the characters you need, there are no problems. The moment one shares a file with someone who uses a different character table, things start going wrong.

Some tables (such as the ISO-8859-* tables) overlap with the same string representing the same characters. Other characters may exist in only one of the tables. These, naturally, are the main point of contention.

There are two solutions to this problem. Either one must have information about the character table used in each file that contains text, or have a table that incorporates each and every character in the world.

Unicode is an implementation of the latter. It allows users to write and exchange information without compatibility worries and with falling prices for storage, it has become very popular. Users only have to make sure that their software supports Unicode and they have fonts installed that can display all the characters they wish to use (as no single font implements all the characters in Unicode).

USE flags

Add the USE flags unicode and nls to your /etc/make.conf:

File: /etc/make.conf
 USE="... nls unicode ..."

To rebuild all changed packages, do a world upgrade:

emerge world --update --newuse

Kernel Stuff

To activate unicode in the kernel set the following in:

Linux Kernel Configuration: Unicode support
 File systems --->
  Native Language Support --->
    (utf8) Default NLS Option
    <*>   NLS UTF8

Now your filenames will be encoded in utf8 per default, after you re-compile your kernel.

If you compiled it as a module, be sure to load it:

# modprobe nls_utf8

To avoid doing this every time you boot, add "nls_utf8" to your /etc/modules.autoload.d/kernel-2.6 or -2.4 file.

Kernel Bugs

Please note that there exists a bug in some Linux kernel versions which affects UTF-8 locales using dead keys. The issue has reportedly been solved since kernel version 2.6.11.

Installing locales

See Locales

Console setup

Add to ~/.bashrc in order to set the console into unicode mode on login (use "unicode_start foo_font" to set your custom font):

File: ~/.bashrc
if [[ $TERM = "linux" ]]; then
  unicode_start
fi

If you're having a multi-user system, you need to do this for every single user.

But, since "unicode_start" requires root privileges, you can instead configure your Gentoo system to default to unicode consoles for all logins. For this to work, you must have a recent version of sys-apps/baselayout installed (>=sys-apps/baselayout-1.11.9).

First, change the unicode setting in /etc/rc.conf

File: /etc/rc.conf
UNICODE="yes"

Mind the case. UNICODE="YES" will NOT work.

Then, to install a good font for UTF-8 consoles called terminus

Code: emerge terminus
emerge -av media-fonts/terminus-font

Also edit the following files, according to their comments:

/etc/conf.d/consolefont
/etc/conf.d/keymaps

You change the font in /etc/conf.d/consolefont.

File: /etc/conf.d/consolefont
CONSOLEFONT=LatArCyrHeb-16  # Latin, Arabic (only isolated forms, Cyrillic, Hebrew)
# take a look at /usr/bin/unicode_start (shell script)

Here are the settings for the German keyboard:

File: /etc/conf.d/keymaps
KEYMAP="de-latin1"
#alternatively: KEYMAP="de-latin1-nodeadkeys"

You mustn't use "-u" in KEYMAP anymore for "base layout".

One example for setting the console font is

File: /etc/conf.d/consolefont
CONSOLEFONT="ter-v16b"
#CONSOLETRANSLATION=""


Now, reboot the system, and the system INIT will automatically enable UTF-8 capability on all console logins. However, a particular console login won't actually display in UTF-8 until receiving a switch-to-unicode escape sequence.

The last step is to make the following change so that the switch-to-unicode escape sequence executes at each login

File: ~/.bash_profile
if test -t 1 -a -t 2 ; then
        echo -n -e '\033%G'
fi

This code instructs the console to switch to unicode if running from a console TTY (and not a terminal emulator or remote shell). In fact, this code block is directly from the internals of the "unicode_start" command.

Or, to make the switch to UTF-8 global for all users (could be problematic)

File: /etc/profile
if test -t 1 -a -t 2 ; then
        echo -n -e '\033%G'
fi

As a final, last-ditch alternative, you can use this init.d script to set all consoles into unicode mode on bootup:

File: /etc/init.d/unicode
#!/sbin/runscript
conf=/etc/env.d/02locale

# Using devfs?
if [ -e /dev/.devfsd ] || [ -e /dev/.udev -a -d /dev/vc ]; then
  device=/dev/vc/
else
  device=/dev/tty
fi

depend() {
        need localmount
        after keymaps
        before consolefont
}

checkconfig() {

  if [ -r ${conf} ]; then
          . ${conf}
          encoding=
          [ -n "${LC_ALL}" ]      && encoding=${LC_ALL#*.}   && return 0
          [ -n "${LC_MESSAGES}" ] && encoding=${LC_MESSAGES#*. } && return 0
          [ -n "${LANG}" ]        && encoding=${LANG#*.}   && return 0
  fi
  eend 1 "Locale is not configured, Please fix ${conf}"
  return 1
}

start() {
        ebegin "setting consoles to UTF-8"
        checkconfig
        if [[ "${encoding}" =~ [uU][tT][fF]-?8 ]]; then
                dumpkeys | loadkeys --unicode
                for ((i=1; i <= "${RC_TTY_NUMBER}"; i++)); do
                        echo -ne "\033%G" > ${device}${i}
                done
                eend 0
        else
                eend 1 "UTF-8 is not required"
        fi
}
Code: to make script executable
chmod +x /etc/init.d/unicode 

and then

Code: add the script
rc-update add unicode default

Sometimes it might be needed to set LC_ALL and LANG environmental options as well, it's easy to set them following the instruction on the page of Gentoo Linux Localization Guide.

Converting old files

Once Unicode support has been added, old files may need to be re-encoded to display properly.

To re-encode the contents of plain text files you have the choice of and iconv, recode and enconv which is in app-i18n/enca).

app-text/convmv is a perl script utility that re-encodes filenames, directory names, and entire subtrees. Emerge it with

Code:
emerge -av app-text/convmv

To test re-encoding a filename from ISO-8859-15 to UTF-8, try

Code:
convmv -f iso-8859-15 -t utf8 file-name-with-

and if the produced command seems sane, add --notest to actually re-encode the name.

Applications

To enter Unicode characters that are not available on your keyboard, you need to press the keys CTRL+Shift and enter the hex value nnnn of the character. Note: You need to use the value of the Unicode notation U+nnnn and not the UTF-8 encoded value.

Terminal emulators

xterm

xterm is running in unicode mode when started with one of:

Code:
xterm -u8
uxterm

If you want xterm to support Unicode without starting it with the parameter "-u", you can also add this to your ~/.Xresources:

Code: xterm Unicode
XTerm*locale: true

After having added this line, you need to run xrdb -merge ~/.Xresources.

URxvt

URxvt from x11-terms/rxvt-unicode is always running in unicode mode. If you want it to use UTF-8, you have to set your LANG accordingly (eg LANG="en_US.UTF-8")

GNU Screen

GNU Screen must be invoked with the -U command line option.

If you are using it as a login shell you will have to write a wrapper that calls screen with the -U option and the options that are called when screen is used as a login shell:

Code: GNU Screen wrapper
#!/bin/sh
exec /usr/bin/screen -xRR -U

For people using it for irssi and so on, making an alias is enough.

File: ~/.bashrc
alias screen="screen -U"

However, if you are running screen from an SSH or RSH session, then editing the screen configuration should be enough.

Add the following to ~/.screenrc

File: ~/.screenrc
defutf8 on

Players

XMMS

XMMS isn't able to handle UTF-8 characters. A replacement is the Beep-Media-Player. It's a GTK v2.0-based XMMS-clone which supports Unicode.

emerge -av beep-media-player

Of course there are many themes and plugins for BMP (Beep Media Player):

emerge -s bmp

Another way to get XMMS running is to create a GTK v1 configuration file: gtkrc.utf8. We copy an existing one (/etc/gtk/gtkrc.iso-8859-14) to /etc/gtk/gtkrc.utf8

 cp /etc/gtk/gtkrc.iso-8859-14 /etc/gtk/gtkrc.utf8

Now we need to edit it and replace every single "iso-8859-14" by a "utf8":

nano /etc/gtk/gtkrc.utf8

Afterwards it should look like:

style "gtk-default-utf8" {
      fontset = "-*-helvetica-medium-r-normal--12-*-*-*-*-*-utf8,\
                 -*-arial-medium-r-normal--12-*-*-*-*-*-utf8,\
                 -*-helvetica-medium-r-normal--12-*-*-*-*-*-utf8,\
                 -*-arial-medium-r-normal--12-*-*-*-*-*-utf8,*-r-*"
}
class "GtkWidget" style "gtk-default-utf8"

Now XMMS shouldn't have any problems displaying UTF-8 encoded characters.

Editors

vim

Unicode should work out of the box, since version 6.3. To make vim display files in UTF-8, add this to your .vimrc:

File: ~/.vimrc
set enc=utf-8

Add this

File: ~/.vimrc
set fenc=utf-8

to make Unicode use UTF-8 for writing files.

If you're terminal is also using UTF-8, add this:

File: ~/.vimrc
set termencoding=utf-8

nano

Versions prior to 1.3.6 can't handle utf8 properly. At the time of writing, this is only needed for the alpha and ppc-macos platforms.

Code:
echo "=app-editors/nano-1.3.6 ~alpha" >> /etc/portage/package.keywords
emerge -uavD nano

Emacs

When run in console mode, can be configured to handle Unicode by adding the following LISP instructions to its configuration file:

File: ~/.emacs
 (setq locale-coding-system 'utf-8)
 (set-terminal-coding-system 'utf-8)
 (set-keyboard-coding-system 'utf-8)
 (set-selection-coding-system 'utf-8)
 (prefer-coding-system 'utf-8)

Notice, however, that the console must handle Unicode too.

KWrite/Kate

Start Kwrite/Kate, go to Settings -> Configure Kate -> Editor Component -> Open/Save. Select UTF-8 in "Encoding". (tested on KDE 4.1 SVN)

LaTeX

Merge Unicode support for LaTeX with

Code:
emerge dev-tex/latex-unicode

News reader

slrn

slrn needs either a ISO-8859-1 based console oder luit:

Code:
LC_ALL=de_DE.iso88591 LANG=en_US.iso88591 xterm -fn "-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1" -e slrn
Code:
LC_ALL=de_DE.iso88591 LANG=en_US.iso88591 luit slrn

Mail

KMail

Open Kmail and go to Settings -> Configure KMail -> Composer -> Charset. There you find a list of charsets which will be checked from top to bottom until one is found. Move "utf-8" to the first position. Then go to Settings -> Configure KMail -> Appearance -> Message Window. Select in the menu list "Fallback message encoding" the item "Unicode (UTF-8)".

Mutt printing

Mutt should work without a flaw on a Unicode console. But if you want to use pretty-printing you need a few tricks as a2ps does not support utf-8. Your best bet may be using ebuild:app-misc/muttprint as it seems to work perfect both in unicode and single-byte environments and produces very elegant output. However it requires latex to be installed on your system.

Emerge the package and put this in your ~/.muttrc

File: ~/.muttrc
set print_command=muttprint

Otherwise you may emerge recode and a2ps:

emerge recode a2ps

and use this in

File: ~/.muttrc
set print_command="recode UTF-8..Latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy -P printername"

You may also use u2ps from the gnome-u2ps package (Debian gnome-u2ps package - don't know if it's also available in Gentoo). It has native Unicode support.

/bin/mail

mail-client/mailx is not able to handle UTF-8, mail-client/nail is.

Code:
emerge --unmerge mailx
emerge nail

You can use /bin/mail of mail-client/mailx to send a UTF-8 encoded mail but you have to set the "charset"-header manually (tested with v8.1.2.20050715-r1):

Code:
echo "" | mail -a "Content-Type: text/plain; charset=utf-8" -s "${subject}" ${recipient}

Syypheed-Claws printing

For printing with Sylpheed, we need to use a2ps and recode.

Code:
emerge recode a2ps
Code:
Print command:
cat %s | recode ..latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy -P <printername>

Substitute "<printername>" with the name of your printer.

If you're using KDE:

Code:
Print command:
cat %s | recode ..latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy | kprinter --stdin

Shells

bash

Bash is unicode-aware since version 2.05b and when using readline version 4.3. Both are in portage.

emerge bash sys-libs/readline
revdep-rebuild --soname libreadline.so.4
rm /lib/libreadline.so.4*

be sure you know what you do when you perform the last step (see the info from the readline ebuild).

You will also need to have the package gentoolkit installed as it contains the revdep-rebuild tool.

The above recommended manual deletion of libreadline.so.4 needs to be double checked!

When I do:

# qfile /lib/libreadline.so.4
sys-libs/readline (/lib/libreadline.so.4)
# eix -s readline
sys-libs/readline-5.2_p12-r1

Apparently, libreadline.so.4 belongs to readline-5*! This is further verified with:

# qlist readline

I propose a "clean-up" on this article as further configuration files are recommended to be modified when further configuration might not be needed. See Talk/Discussion link at the top of this page for further info on these issues. I too believe a lot of this stuff should already be implemented within /etc/rc.conf and the unicode USE Flag.

zsh

Zsh handles UTF-8 perfectly since version 4.3.1. Older versions are not yet Unicode aware. It still works as long as you don't use Backspace on unicode characters. (This deletes parts of the UTF-8 character byte wise and confuses zle assumptions about the cursor position.)

mc

Mc must be compiled with the sys-libs/slang library for full Unicode support.

emerge gentoolkit
euse -E slang
emerge -avDN mc

X

Applications such as Fluxbox and Sylpheed-Claws might cause problems when not being merged with the USE flag +cjk. The affected applications would take long to start and consume much CPU resources. Meanwhile there are release candidates of Fluxbox 1.0 which have a better UTF-8 support. X usually obeys the LC_* environment variables; however, X is picky about how you spell your locale settings. What works in the console may not work in X. You can find a list of all acceptable locale aliases in /usr/lib/X11/locale/locale.alias. As always, CaSe matters. You should make sure that the locale you choose corresponds to one of the glibc locales "locale -a".

If you're doing advanced troubleshooting you may also be interested in the locale.dir file, in the same directory. It maps locale names to files. Make sure it maps your locale correctly (it usually does).

So to sum it up, the chain goes like this, and all of its links must be intact: LC_* -> locale.alias -> locale.dir -> [X locale definition file]

Microsoft Windows partitions

If you're using VFAT partitions, you need to modify the mount options.

File: /etc/fstab
/dev/hdxY        /mnt/windows1        vfat        iocharset=utf8,codepage=850        0 0
/dev/hdxY        /mnt/windows2        ntfs        nls=utf8                           0 0
//samba2/share   /mnt/windows3        smbfs       iocharset=utf8,codepage=cp850      0 0
//samba3/share   /mnt/windows4        smbfs       iocharset=utf8                     0 0
/dev/cdrom       /media/cdrom         udf,iso9660 iocharset=utf8,ro,user,noauto      0 0

There are differences between Samba v2.2 (DOS, Microsoft Windows 9x and Microsoft Windows Millennium) and Samba v3 (Microsoft Windows 2000 and Microsoft Windows XP). See http://us5.samba.org/samba/docs/man/Samba3-HOWTO/unicode.html

Note: VFAT requires codepage to be 850 and smbfs cp850. You can also use these values in the kernel configuration, so that they can also be used by HAL.

Samba

Disable the following option in the kernel, if set:

File systems --->
  Network File Systems --->
    <M> SMB file system support (to mount Windows shares etc.)
      [ ] Use a default NLS 

Microsoft Windows NT/2000/2003/XP are able to handle UTF-8 but DOS or other Microsoft Windows 9x or Millennium clients need them because those operating systems aren't able to handle UTF-8 and need cp850.

Alternatively you can use CIFS:

File systems --->
  Network File Systems --->
    <M> CIFS support (advanced network filesystem for Samba, Window and other CIFS compliant servers)

Fluxbox

Fluxbox doesn't fully support unicode yet. Some of its styles are selecting fonts that are not suitable for unicode. To fix this you will have to edit the Fluxbox's stylefile(s) in /usr/share/fluxbox/styles and add something like:

File: /usr/share/fluxbox/styles/$YourStyle
window.font:                         -*-*-*-*-*-*-*-*-*-*-*-*-*-u

to at least fix the window title bug.

Solution by user Holms:

Another solution is to set locale in ~/.xinitrc. For example I'm using Cyrillic most of a time. If you will write this in your ~/.xinitrc

File: ~/.xinitrc
export LANG="ru_RU.UTF-8"
export LC_ALL="ru_RU.UTF-8"

then all windows title will be in unicode and your locale will be Russian, set this to you country. Maybe it will be clever to put en_EN.UTF-8 instead of that, because all programs will start display everything in your language instead of english. UTF-8 shows to the system which encoding you'll be using by default so you want Unicode you get Unicode. By the way add same two line to the ~/.bashrc (at least some people prefer to do this, but didn't helped to me) and do not forget to configure your locales in /etc/locale.gen. If you haven't configured it yet, go to Gentoo handbook and read about locales. If this doesn't help try to read HOWTO Xorg and Fonts. Do everything that written in "Emerging the necessary packages" section, at least that helped to me.

OpenOffice.org

To force OpenOffice.org to use UTF-8 (you'll have problems when entering unicode characters) you have to set the LANGUAGE variable to an appropriate value:

File: /etc/env.d/02locale
LANG="de_DE.UTF-8"
...a lot of LC-Variables...
# For OpenOffice.org
LANGUAGE="en_GB:en"

Don't forget to run env-update && source /etc/profile after changing files in /etc/env.d/. Maybe you'll need to login again to apply the changes to your current environment.

Links


Last modified: Sun, 07 Sep 2008 12:30:00 +1000 Hits: 111,122

Created by NickStallman.net, Luxury Homes Australia
Real estate agents should list their apartments, townhouses and units in Australia.