Regex unfolding accented/special non UTF-8 chars to ASCII-7 equivalents

1829 posts Time is what you desire most, but waste carelessly.
  • Author Level 9
  • Elite Author
  • Exclusive Author
  • Trendsetter
+11 more

Sorry no unfolding but folding.

If you put certain characters in UTF -8 HTML documents, like o= or u= (only ü and ö are in the UTF -8 table) the thing is it works for general purpose. I thought they get coded in the HTML with their html entity number. I would like to unfold these but I failed miserably..

Let’s take “u=” for example (Hungarian character). Envato can’t even process it :D See that’s how important this would be if it could work.

The only way it works is when I look for ű I don’t know what is that called, I can’t reproduce how I got it and I don’t want to look for them one by one. I only know I have seen these kind of characters when there was an error with the encoding, mostly in SQL tables, so it’s kind of a last resort entity reference, I have no idea…

        $string = preg_replace( 
        $string = html_entity_decode($string);
The string is UTF -8 encoded by default. Before any further processing and accent unfolding this code strips the “u=” and converts it “u”. But this is only one. What am I missing here.

        $string = preg_replace( 
This is close to what I use. Just an example.

I thought I’d just write a regex search for the unnamed characters, referencing them by their numbers. The numbers.


This pattern would match all the u= like characters from Ũ to ų , inlcuding u= which is number ű – but it didn’t work. How could I use this pattern? It seems like I can only use it for UTF -8 characters.


(This is what I use, like in the example above.) This would match (basically everything) a lot of things including all number references and also including some non UTF -8 but named entities like Σ and when told it to replace with $1 backreference it would return S (the first char), but no joy here either.. Also tried it with sigma’s number reference, no luck.

What can I do? I don’t want to just strip them, the number range pattern solution would be so convenient! Um.. do I smell hex in the air?