Merge pull request #1 from kynx/fix-word-separation

Fix word separation with snake_case
kynx · Oct 9, 2022 · 557ff32 · 557ff32
2 parents 8066ba5 + 84df929
commit 557ff32
Show file tree

Hide file tree

Showing 8 changed files with 228 additions and 110 deletions.
diff --git a/README.md b/README.md
@@ -7,18 +7,18 @@ Utilities for generating PHP code.
 
 ## Normalizers
 
-The normalizers generate PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings, 
+The normalizers generate readable PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings, 
 [transliterating] them to ASCII and spelling out any invalid characters.
 
 ### Usage:
 
-The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Shop"):
+The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Store"):
 ```php
 <?php
 
 use Kynx\CodeUtls\ClassNameNormalizer;
 
-$normalizer = new ClassNameNormalizer();
+$normalizer = new ClassNameNormalizer('Controller');
 $namespace = $normalizer->normalize('ペット \ ショップ');
 echo $namespace;
 ```
@@ -48,23 +48,34 @@ See the [tests] for more examples.
 
 ### Why?
 
-You should never generate code from untrusted user input. But there are a few cases where you may want to do it with 
-mostly-trusted input. In my case, it's generating classes and properties from an OpenAPI specification, where there are
-no restrictions on the characters present. 
+You must **never** run code generated from untrusted user input. But there are a few cases where you do want to 
+_output_ code generated from (mostly) trusted input.
+
+In my case, I need to generate classes and properties from an OpenAPI specification. There are no hard-and-fast rules
+on the characters present, just a vague "it is RECOMMENDED to follow common programming naming conventions". Whatever 
+they are. 
 
 ### How?
 
-`AbstractNormalizer` uses `ext-intl`'s [Transliterator] to perform the transliteration. Where a character has no 
+Each normalizer uses `ext-intl`'s [Transliterator] to turn the UTF-8 string into Latin-ASCII. Where a character has no 
 equivalent in ASCII (the "€" symbol is a good example), it uses the [Unicode name] of the character to spell it out (to 
-`Euro`). For ASCII characters that are not valid in a PHP label, it provides it's own spell outs: for instance, a 
-backtick "`" becomes `Backtick`.
+`Euro`, after some minor clean-up). For ASCII characters that are not valid in a PHP label, it provides its own spell 
+outs. For instance, a backtick "&#96;" becomes `Backtick`.
+
+Initial digits are also spelt out: "123foo" becomes `OneTwoThreeFoo`. Finally reserved words are suffixed with a 
+user-supplied string so they don't mess things up. In the first usage example above, if we normalized "class" it would 
+become `ClassController`.
+
+The results may not be pretty. If for some mad reason your input contains ` ͖`  - put your glasses on! - the label will 
+contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a chance of being as unique as 
+the original. But speaking of which...
 
-Initial digits are also spelt out - "123 foo" becomes `OneTwoThreeFoo` - and finally reserved words are suffixed with a 
-user-supplied string so they don't mess things up: "class" can become `ClassController`.
+### Uniqueness
 
-The results may not be pretty. For instance, if your input contains ` ͖`  - put your glasses on! - the class name will 
-contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a good chance of being as unique 
-as the original.  
+The normalization process reduces around a million Unicode code points down to just 162 ASCII characters. Then we mangle 
+it further by stripping separators, reducing whitespace and turning it into camelCase, snake_case or whatever 
+your programming preference. It's gonna be lossy - nothing we can do about that. Ideally this library would provide a
+utility for guaranteeing uniqueness across a set of labels, but I haven't written it yet. Feel free to contribute!
 
 
 [transliterating]: https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration

diff --git a/phpunit.xml.dist b/phpunit.xml.dist
@@ -23,4 +23,10 @@
             <directory suffix=".php">src</directory>
         </include>
     </coverage>
+
+    <php>
+        <!-- Seems to be needed by CI's PHP8.2-RC1? Not needed in PHP8.2-dev locally! -->
+        <ini name="assert.exception" value="1" />
+        <ini name="assert.warning" value="0" />
+    </php>
 </phpunit>
diff --git a/src/AbstractNormalizer.php b/src/AbstractNormalizer.php
@@ -9,10 +9,11 @@
 use IntlCodePointBreakIterator;
 use Transliterator;
 
+use function array_filter;
 use function array_map;
 use function array_shift;
+use function array_slice;
 use function assert;
-use function count;
 use function explode;
 use function implode;
 use function in_array;
@@ -27,7 +28,6 @@
 use function strtolower;
 use function substr;
 use function trim;
-use function ucfirst;
 
 /**
  * Utility for generating valid PHP labels from UTF-8 strings
@@ -132,67 +132,66 @@ abstract class AbstractNormalizer implements NormalizerInterface
     ];
 
     private const ASCII_SPELLOUT = [
-        1   => 'StartOfHeader',
-        2   => 'StartOfText',
-        3   => 'EndOfText',
-        4   => 'EndOfTransmission',
+        1   => 'Start Of Header',
+        2   => 'Start Of Text',
+        3   => 'End Of Text',
+        4   => 'End Of Transmission',
         5   => 'Enquiry',
         6   => 'Acknowledgement',
         7   => 'Bell',
         8   => 'Backspace',
-        9   => 'HorizontalTab',
-        10  => 'LineFeed',
-        11  => 'VerticalTab',
-        12  => 'FormFeed',
-        13  => 'CarriageReturn',
-        14  => 'ShiftOut',
-        15  => 'ShiftIn',
-        16  => 'DataLinkEscape',
-        17  => 'DeviceControlOne',
-        18  => 'DeviceControlTwo',
-        19  => 'DeviceControlThree',
-        20  => 'DeviceControlFour',
-        21  => 'NegativeAcknowledgement',
-        22  => 'SynchronousIdle',
-        23  => 'EndOfTransmissionBlock',
+        9   => 'Horizontal Tab',
+        10  => 'Line Feed',
+        11  => 'Vertical Tab',
+        12  => 'Form Feed',
+        13  => 'Carriage Return',
+        14  => 'Shift Out',
+        15  => 'Shift In',
+        16  => 'Data Link Escape',
+        17  => 'Device Control One',
+        18  => 'Device Control Two',
+        19  => 'Device Control Three',
+        20  => 'Device Control Four',
+        21  => 'Negative Acknowledgement',
+        22  => 'Synchronous Idle',
+        23  => 'End Of Transmission Block',
         24  => 'Cancel',
-        25  => 'EndOfMedium',
+        25  => 'End Of Medium',
         26  => 'Substitute',
         27  => 'Escape',
-        28  => 'FileSeparator',
-        29  => 'GroupSeparator',
-        30  => 'RecordSeparator',
-        31  => 'UnitSeparator',
-        32  => 'Space',
+        28  => 'File Separator',
+        29  => 'Group Separator',
+        30  => 'Record Separator',
+        31  => 'Unit Separator',
         33  => 'Exclamation',
-        34  => 'DoubleQuote',
+        34  => 'Double Quote',
         35  => 'Number',
         36  => 'Dollar',
         37  => 'Percent',
         38  => 'Ampersand',
         39  => 'Quote',
-        40  => 'OpenBracket',
-        41  => 'CloseBracket',
+        40  => 'Open Bracket',
+        41  => 'Close Bracket',
         42  => 'Asterisk',
         43  => 'Plus',
         44  => 'Comma',
-        46  => 'FullStop',
+        46  => 'Full Stop',
         47  => 'Slash',
         58  => 'Colon',
         59  => 'Semicolon',
-        60  => 'LessThan',
+        60  => 'Less Than',
         61  => 'Equals',
-        62  => 'GreaterThan',
-        63  => 'QuestionMark',
+        62  => 'Greater Than',
+        63  => 'Question Mark',
         64  => 'At',
-        91  => 'OpenSquare',
+        91  => 'Open Square',
         92  => 'Backslash',
-        93  => 'CloseSquare',
+        93  => 'Close Square',
         94  => 'Caret',
         96  => 'Backtick',
-        123 => 'OpenCurly',
+        123 => 'Open Curly',
         124 => 'Pipe',
-        125 => 'CloseCurly',
+        125 => 'Close Curly',
         126 => 'Tilde',
         127 => 'Delete',
     ];
@@ -252,30 +251,36 @@ protected function toAscii(string $string): string
         return $this->spellOutNonAscii(implode(' ', $words));
     }
 
-    protected function separatorsToUnderscore(string $string): string
+    protected function separatorsToSpace(string $string): string
     {
-        return preg_replace('/[' . $this->separators . '\s]+/', '_', trim($string));
+        return preg_replace('/[' . $this->separators . '\s_]+/', ' ', trim($string));
     }
 
     protected function spellOutAscii(string $string): string
     {
-        $chunks = str_split($string);
-        $last   = count($chunks) - 1;
-        foreach (str_split($string) as $i => $char) {
-            if (isset(self::ASCII_SPELLOUT[ord($char)])) {
-                $char = self::ASCII_SPELLOUT[ord($char)] . ($i < $last ? '_' : '');
+        $speltOut = [];
+        $current  = '';
+
+        foreach (str_split($string) as $char) {
+            $ord = ord($char);
+            if (! isset(self::ASCII_SPELLOUT[$ord])) {
+                $current .= $char;
+                continue;
             }
-            $chunks[$i] = $char;
+
+            $speltOut[] = $current;
+            $speltOut[] = self::ASCII_SPELLOUT[$ord];
+            $current    = '';
         }
+        $speltOut[] = $current;
 
-        return $this->spellOutLeadingDigits(implode('', $chunks));
+        return $this->spellOutLeadingDigits(implode(' ', $speltOut));
     }
 
     protected function toCase(string $string): string
     {
-        assert(in_array($this->case, self::VALID_CASES));
-
-        $parts = explode('_', $string);
+        /** @var list<string> $parts */
+        $parts = array_filter(explode(' ', $string));
         return match ($this->case) {
             self::CAMEL_CASE  => $this->toCamelCase($parts),
             self::PASCAL_CASE => $this->toPascalCase($parts),
@@ -284,11 +289,11 @@ protected function toCase(string $string): string
         };
     }
 
-    protected function sanitizeReserved(string $string, array $reserved): string
+    protected function sanitizeReserved(string $string): string
     {
         assert($this->suffix !== null);
 
-        if (in_array(strtolower($string), $reserved, true)) {
+        if (in_array(strtolower($string), self::RESERVED, true)) {
             return $string . $this->suffix;
         }
         return $string;
@@ -297,10 +302,10 @@ protected function sanitizeReserved(string $string, array $reserved): string
     private function prepareSuffix(string|null $suffix, string $case): string|null
     {
         if ($suffix === null) {
-            return $suffix;
+            return null;
         }
 
-        if ($suffix === '' || ! preg_match('/^[a-zA-Z0-9_\x80-\xff]*$/', $suffix)) {
+        if (! preg_match('/^[a-zA-Z0-9_\x80-\xff]+$/', $suffix)) {
             throw NormalizerException::invalidSuffix($suffix);
         }
 
@@ -312,46 +317,53 @@ private function prepareSuffix(string|null $suffix, string $case): string|null
 
     private function spellOutNonAscii(string $string): string
     {
-        $speltOut = '';
+        $speltOut = [];
+        $current  = '';
 
         $this->codePoints->setText($string);
         /** @var string $char */
         foreach ($this->codePoints->getPartsIterator() as $char) {
-            $ord       = IntlChar::ord($char);
-            $speltOut .= $ord < 256 ? $char : $this->spellOutNonAsciiChar($ord);
+            $ord = IntlChar::ord($char);
+            if ($ord < 256) {
+                $current .= $char;
+                continue;
+            }
+
+            $speltOut[] = $current;
+            $speltOut[] = $this->spellOutNonAsciiChar($ord);
+            $current    = '';
         }
+        $speltOut[] = $current;
 
-        return $speltOut;
+        return implode(' ', $speltOut);
     }
 
     private function spellOutNonAsciiChar(int $ord): string
     {
         $speltOut = IntlChar::charName($ord);
 
-        // 'EURO SIGN' -> 'Euro'
-        return implode('', array_map(function (string $part): string {
-            return $part === 'SIGN' ? '' : ucfirst(strtolower($part));
-        }, explode(" ", $speltOut)));
+        // 'EURO SIGN' -> 'euro'
+        return implode(' ', array_map(function (string $part): string {
+            return $part === 'SIGN' ? '' : strtolower($part);
+        }, explode(' ', $speltOut)));
     }
 
     private function spellOutLeadingDigits(string $string): string
     {
-        $chunks = str_split($string);
+        $speltOut = [];
+        $chunks   = str_split($string);
         foreach ($chunks as $i => $char) {
-            if ($i > 1 && $char === '_') {
-                $chunks[$i] = '';
-                break;
-            }
-
             $ord = ord($char);
+
             if (! isset(self::DIGIT_SPELLOUT[$ord])) {
+                $speltOut[] = implode('', array_slice($chunks, $i));
                 break;
             }
 
-            $chunks[$i] = self::DIGIT_SPELLOUT[$ord] . '_';
+            $speltOut[] = self::DIGIT_SPELLOUT[$ord];
         }
 
-        return implode('', $chunks);
+        return implode(' ', $speltOut);
     }
 
     /**

diff --git a/src/ClassNameNormalizer.php b/src/ClassNameNormalizer.php
@@ -36,11 +36,11 @@ public function normalize(string $label): string
 
     private function normalizeLabel(string $label): string
     {
-        $ascii       = $this->toAscii($label);
-        $underscored = $this->separatorsToUnderscore($ascii);
-        $speltOut    = $this->spellOutAscii($underscored);
-        $cased       = $this->toCase($speltOut);
+        $ascii    = $this->toAscii($label);
+        $spaced   = $this->separatorsToSpace($ascii);
+        $speltOut = $this->spellOutAscii($spaced);
+        $cased    = $this->toCase($speltOut);
 
-        return $this->sanitizeReserved($cased, self::RESERVED);
+        return $this->sanitizeReserved($cased);
     }
 }
diff --git a/src/ConstantNameNormalizer.php b/src/ConstantNameNormalizer.php
@@ -22,11 +22,11 @@ public function __construct(
      */
     public function normalize(string $label): string
     {
-        $ascii       = $this->toAscii($label);
-        $underscored = $this->separatorsToUnderscore($ascii);
-        $speltOut    = $this->spellOutAscii($underscored);
-        $cased       = $this->toCase($speltOut);
+        $ascii    = $this->toAscii($label);
+        $spaced   = $this->separatorsToSpace($ascii);
+        $speltOut = $this->spellOutAscii($spaced);
+        $cased    = $this->toCase($speltOut);
 
-        return $this->sanitizeReserved($cased, self::RESERVED);
+        return $this->sanitizeReserved($cased);
     }
 }
diff --git a/src/PropertyNameNormalizer.php b/src/PropertyNameNormalizer.php
@@ -21,9 +21,9 @@ public function __construct(string $case = self::CAMEL_CASE, string $separators
      */
     public function normalize(string $label): string
     {
-        $ascii       = $this->toAscii($label);
-        $underscored = $this->separatorsToUnderscore($ascii);
-        $speltOut    = $this->spellOutAscii($underscored);
+        $ascii    = $this->toAscii($label);
+        $spaced   = $this->separatorsToSpace($ascii);
+        $speltOut = $this->spellOutAscii($spaced);
 
         return $this->toCase($speltOut);
     }