Sorting UTF-8 strings in PHP
Created:29 Apr 2017 16:41:28
When trying to sort a list of strings in PHP, one needs a reliable way of comparing the strings first.
PHP has some nice comparison functions, notably strcmp. Strcmp function works
fine for ASCII character set, in which the same number of bits ( 7 ) is used to encode a symbol,
but can hardly be relied upon when a need to compare symbols coming from a variable-width encoding like UTF-8 arises.
UTF-8 is both recommended and already easily the most popular encoding used on the web now, hence the one I'm the most interested in too (for this article and as a long-standing user of the technology).
Collation aware string comprison
Since strcmp can't be reliably used for characters encoded in UTF-8, some other function is needed. PHP provides function
strcoll which also compares strings but does it according to LC_COLLATE value.
LC_COLLATE is a variable in locales. It determines character collation order, which is a feature, many functions and utilities working with strings rely upon heavily. Locale is a group of parameters that contain information like user's preferred language, region, possibly also some extra preferences for user interface. Other notable variables in locale include
LC_CTYPE, which specifies type of encoding allowed, like for example 1-byte ISO-8859-1 (256 characters encoded) or the above-mentioned 7-bit ASCII (128 charchacters encoded) or LC_TIME used to determine how dates and time are formatted.
Getting and setting LC_COLLATE value
Type man locale in terminal console for more info about locale, and man locale -av for detailed listing of what's currently installed on your system.
LC_LOCALE is frequently set to C (under this default mode, collation is done in strict numeric order), which is fine for ASCII but not for multi-byte encoding like UTF-8.
As far as PHP is concerned LC_COLLATE can be found and set using setlocale function.
Finding current LC_COLLATE locale value:
Setting LC_COLLATE value with setlocale:
As a concrete example, British English using UTF-8 is en_GB.UTF-8.
LC_COLLATE value for Polish Language and UTF-8 encoding is pl_PL.UTF-8.
With collation setting configured, one can start using strcoll function for string comparisons and functions like usort or natsort for sorting strings.
Here are some examples:
Case-sensitive sorting of strings in Polish language encoded in UTF-8:
$PL = array('łyżka','Żeźnia','żebrak','grzegrzółka','Ósemka','2-mięsieczny źrebak');
=> array('2-mięsieczny źrebak','grzegrzółka','łyżka','Ósemka','żebrak','Żeźnia'
Case-sensitive sorting of strings in German language encoded in UTF-8:
$DE = array('unglück','laßt','schönen','blühe','waschbär','schildkröte');
Sorting UTF-8 encoded strings in PHP is not without its probles, nonetheless as long as collation is configured, it gives desired outcome.
This post was updated on
29 Apr 2017 17:08:13
php , sort