Sorting UTF-8 strings in PHP

Sorting UTF-8 strings in PHP

Created:29 Apr 2017 16:41:28 , in  Web development

When trying to sort a list of strings in PHP, one needs a reliable way of comparing the strings first. PHP has some nice comparison functions, notably strcmp. Strcmp function works fine for ASCII character set, in which the same number of bits ( 7 ) is used to encode a symbol, but can hardly be relied upon when a need to compare symbols coming from a variable-width encoding like UTF-8 arises.

UTF-8 is both recommended and already easily the most popular encoding used on the web now, hence the one I'm the most interested in too (for this article and as a long-standing user of the technology).

Collation aware string comprison

Since strcmp can't be reliably used for characters encoded in UTF-8, some other function is needed. PHP provides function strcoll which also compares strings but does it according to LC_COLLATE value.

LC_COLLATE is a variable in locales. It determines character collation order, which is a feature, many functions and utilities working with strings rely upon heavily. Locale is a group of parameters that contain information like user's preferred language, region, possibly also some extra preferences for user interface. Other notable variables in locale include LC_CTYPE, which specifies type of encoding allowed, like for example 1-byte ISO-8859-1 (256 characters encoded) or the above-mentioned 7-bit ASCII (128 charchacters encoded) or LC_TIME used to determine how dates and time are formatted.

Getting and setting LC_COLLATE value

Type man locale in terminal console for more info about locale, and man locale -av for detailed listing of what's currently installed on your system.

LC_LOCALE is frequently set to C (under this default mode, collation is done in strict numeric order), which is fine for ASCII but not for multi-byte encoding like UTF-8.

As far as PHP is concerned LC_COLLATE can be found and set using setlocale function.

Finding current LC_COLLATE locale value:


Setting LC_COLLATE value with setlocale:


As a concrete example, British English using UTF-8 is en_GB.UTF-8. LC_COLLATE value for Polish Language and UTF-8 encoding is pl_PL.UTF-8.

Sorting examples

With collation setting configured, one can start using strcoll function for string comparisons and functions like usort or natsort for sorting strings.

Here are some examples:

Case-sensitive sorting of strings in Polish language encoded in UTF-8:

$PL = array('łyżka','Żeźnia','żebrak','grzegrzółka','Ósemka','2-mięsieczny źrebak');
=> array('2-mięsieczny źrebak','grzegrzółka','łyżka','Ósemka','żebrak','Żeźnia'

Case-sensitive sorting of strings in German language encoded in UTF-8:

$DE = array('unglück','laßt','schönen','blühe','waschbär','schildkröte');
=> array('blühe','laßt','schildkröte','schönen','unglück','waschbär')

Final thoughts

Sorting UTF-8 encoded strings in PHP is not without its probles, nonetheless as long as collation is configured, it gives desired outcome.

This post was updated on 29 Apr 2017 17:08:13

Tags:  php ,  sort