Sorting UTF-8 strings in PHP

Sorting UTF-8 strings in PHP

Created:29 Apr 2017 16:41:28 , in  Web development

When trying to sort a list of strings in PHP, one needs a reliable way of comparing the strings first. PHP has some proven comparison functions, notably strcmp. Strcmp works fine for ASCII character set, in which the same number of bits, namely seven, is used to encode a symbol. Unfortunately, it can hardly be relied upon for comparisons of strings encoded in variable-width encodings, among them UTF-8.

UTF-8 encoding has de facto been the standard encoding on the web for some time. Thanks to its fast-growing popularity and wide adoption over the last couple of years, ASCII-based encodings like ISO 8859-2 are no longer the only acceptable option for building websites for languages other than English.

As for downsides, due to UTF-8 being variable-width encoding, conducting string operations, like comparisons for example, has just become a wee bit more difficult. One has to pay attention to character collations to perform them successfully now.

Collation-aware string comaprison

Strcmp function in PHP pays no attention to character collations. Fortunately PHP provides other function, called strcoll, which also compares strings but does it according to LC_COLLATE value.

LC_COLLATE is a variable in locales. It determines character collation order. It is a feature, many functions and utilities working on strings rely upon heavily. Locale is a group of parameters that contain information like user's preferred language, region, possibly also some extra preferences for user interface.

Other notable variables in locale include LC_CTYPE and LC_TIME. The former specifies type of encoding allowed, like 8-bit ISO-8859-1 (256 characters encoded) or 7-bit ASCII (128 characters encoded). The latter is used to configure format of dates and time.

Getting and setting LC_COLLATE value

For Linux based distribution like Debian, type man locale in your terminal console to obtain more information about locale, or man locale -av for detailed listing of locales installed on your system.

Sometimes LC_LOCALE is set to C (under this default mode, collation is done in strict numeric order), which is fine for ASCII character set but not for multi-byte encoding like UTF-8.

As far as PHP is concerned, LC_COLLATE can be found and set using setlocale function.

Finding current LC_COLLATE locale value:


setlocale(LC_COLLATE,"0");

Setting LC_COLLATE value with setlocale:


setlocale(LC_COLLATE,language_teritory.codeset)

As for concrete examples, correct LC_COLLATE value for content written in English language for Great Britain and encoded in UTF-8 is en_GB.UTF-8 . Similarly LC_COLLATE value for content in Polish language and for Poland is pl_PL.UTF-8.

Sorting examples

With collation setting configured, one can start using strcoll function for string comparisons and functions like usort or natsort for sorting strings.

Here are some examples:

Case-sensitive sorting of strings in Polish language encoded in UTF-8:


setlocale(LC_COLLATE,'pl_PL.UTF-8'); 
$PL = array('łyżka','Żeźnia','żebrak','grzegrzółka','Ósemka','2-mięsieczny źrebak');
usort($PL,'strcoll');  
=> array('2-mięsieczny źrebak','grzegrzółka','łyżka','Ósemka','żebrak','Żeźnia'

Case-sensitive sorting of strings in German language encoded in UTF-8:


setlocale(LC_COLLATE,'de_DE.UTF-8');
$DE = array('unglück','laßt','schönen','blühe','waschbär','schildkröte');
usort($DE,'strcoll'); 
=> array('blühe','laßt','schildkröte','schönen','unglück','waschbär')

Using Collator PHP library

As an alternative, collation-aware string comparisons and sorting can be carried out using Collator library. This library is not available unless internationalization PHP extension has been installed, which for Linux distribution like Debian GNU/Linux and many of its derivatives can be achieved with either pecl or preferably apt-get utility.

Logged in as privileged user enter:


apt-get instal php[your-php-version]-intl

Replace [php-version] with whatever version you need internationalization module for. Once installation process is complete, Collator library is available for both php-cli and Apache ( you might need to reload your server configuration first ) and ready to use.

Here is a quick example of a comparison and sorting of some Polish language words with Collator.


$collator = new Collator('pl_PL');

// comparing
$collator -> compare ( 'świerk' , 'sosna' )
=> 1

// sorting
$sortable = array('ściana' ,'słowo','ćwikła','cena');
$collator -> sort ( $sortable );

// new order of words in $sortable
=> array('cena','ćwikła','słowo','ściana')

Final thoughts

Overall, dealing with UTF-8 encoded strings in PHP is not without its problems. As for comparisons and sorting, if collation is set right, they are easy to carry out successfully.

This post was updated on 09 Aug 2017 00:54:07

Tags:  php ,  sort 


Author, Copyright and citation

Author

Sylwester Wojnowski

Author of the above article, Sylwester Wojnowski, is sWWW admin and owner.He enjoys doing Maths and studying algorithms, writing code in scripting and command languages, Thrash Metal music and playing electric guitar.

Copyrights

©Copyright, 2017 Sylwester Wojnowski. This article may not be reproduced or published as a whole or in parts without permission from the author. If you share it, please give author credit and do not remove embedded links.

Computer code, if present in the article, is excluded from the above and licensed under GPLv3.

Citation

Cite this article as:

Wojnowski, Sylwester. "Sorting UTF-8 strings in PHP." From sWWW - Code For The Web . https://wojnowski.net.pl//main/index/sorting-utf-8-strings-in-php