Skip to content

Commit

Permalink
Merge pull request #1 from kynx/fix-word-separation
Browse files Browse the repository at this point in the history
Fix word separation with snake_case
  • Loading branch information
kynx authored Oct 9, 2022
2 parents 8066ba5 + 84df929 commit 557ff32
Show file tree
Hide file tree
Showing 8 changed files with 228 additions and 110 deletions.
39 changes: 25 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@ Utilities for generating PHP code.

## Normalizers

The normalizers generate PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
The normalizers generate readable PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
[transliterating] them to ASCII and spelling out any invalid characters.

### Usage:

The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Shop"):
The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Store"):
```php
<?php

use Kynx\CodeUtls\ClassNameNormalizer;

$normalizer = new ClassNameNormalizer();
$normalizer = new ClassNameNormalizer('Controller');
$namespace = $normalizer->normalize('ペット \ ショップ');
echo $namespace;
```
Expand Down Expand Up @@ -48,23 +48,34 @@ See the [tests] for more examples.

### Why?

You should never generate code from untrusted user input. But there are a few cases where you may want to do it with
mostly-trusted input. In my case, it's generating classes and properties from an OpenAPI specification, where there are
no restrictions on the characters present.
You must **never** run code generated from untrusted user input. But there are a few cases where you do want to
_output_ code generated from (mostly) trusted input.

In my case, I need to generate classes and properties from an OpenAPI specification. There are no hard-and-fast rules
on the characters present, just a vague "it is RECOMMENDED to follow common programming naming conventions". Whatever
they are.

### How?

`AbstractNormalizer` uses `ext-intl`'s [Transliterator] to perform the transliteration. Where a character has no
Each normalizer uses `ext-intl`'s [Transliterator] to turn the UTF-8 string into Latin-ASCII. Where a character has no
equivalent in ASCII (the "€" symbol is a good example), it uses the [Unicode name] of the character to spell it out (to
`Euro`). For ASCII characters that are not valid in a PHP label, it provides it's own spell outs: for instance, a
backtick "`" becomes `Backtick`.
`Euro`, after some minor clean-up). For ASCII characters that are not valid in a PHP label, it provides its own spell
outs. For instance, a backtick "&#96;" becomes `Backtick`.

Initial digits are also spelt out: "123foo" becomes `OneTwoThreeFoo`. Finally reserved words are suffixed with a
user-supplied string so they don't mess things up. In the first usage example above, if we normalized "class" it would
become `ClassController`.

The results may not be pretty. If for some mad reason your input contains ` ͖` - put your glasses on! - the label will
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a chance of being as unique as
the original. But speaking of which...

Initial digits are also spelt out - "123 foo" becomes `OneTwoThreeFoo` - and finally reserved words are suffixed with a
user-supplied string so they don't mess things up: "class" can become `ClassController`.
### Uniqueness

The results may not be pretty. For instance, if your input contains ` ͖` - put your glasses on! - the class name will
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a good chance of being as unique
as the original.
The normalization process reduces around a million Unicode code points down to just 162 ASCII characters. Then we mangle
it further by stripping separators, reducing whitespace and turning it into camelCase, snake_case or whatever
your programming preference. It's gonna be lossy - nothing we can do about that. Ideally this library would provide a
utility for guaranteeing uniqueness across a set of labels, but I haven't written it yet. Feel free to contribute!


[transliterating]: https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration
Expand Down
6 changes: 6 additions & 0 deletions phpunit.xml.dist
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,10 @@
<directory suffix=".php">src</directory>
</include>
</coverage>

<php>
<!-- Seems to be needed by CI's PHP8.2-RC1? Not needed in PHP8.2-dev locally! -->
<ini name="assert.exception" value="1" />
<ini name="assert.warning" value="0" />
</php>
</phpunit>
152 changes: 82 additions & 70 deletions src/AbstractNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@
use IntlCodePointBreakIterator;
use Transliterator;

use function array_filter;
use function array_map;
use function array_shift;
use function array_slice;
use function assert;
use function count;
use function explode;
use function implode;
use function in_array;
Expand All @@ -27,7 +28,6 @@
use function strtolower;
use function substr;
use function trim;
use function ucfirst;

/**
* Utility for generating valid PHP labels from UTF-8 strings
Expand Down Expand Up @@ -132,67 +132,66 @@ abstract class AbstractNormalizer implements NormalizerInterface
];

private const ASCII_SPELLOUT = [
1 => 'StartOfHeader',
2 => 'StartOfText',
3 => 'EndOfText',
4 => 'EndOfTransmission',
1 => 'Start Of Header',
2 => 'Start Of Text',
3 => 'End Of Text',
4 => 'End Of Transmission',
5 => 'Enquiry',
6 => 'Acknowledgement',
7 => 'Bell',
8 => 'Backspace',
9 => 'HorizontalTab',
10 => 'LineFeed',
11 => 'VerticalTab',
12 => 'FormFeed',
13 => 'CarriageReturn',
14 => 'ShiftOut',
15 => 'ShiftIn',
16 => 'DataLinkEscape',
17 => 'DeviceControlOne',
18 => 'DeviceControlTwo',
19 => 'DeviceControlThree',
20 => 'DeviceControlFour',
21 => 'NegativeAcknowledgement',
22 => 'SynchronousIdle',
23 => 'EndOfTransmissionBlock',
9 => 'Horizontal Tab',
10 => 'Line Feed',
11 => 'Vertical Tab',
12 => 'Form Feed',
13 => 'Carriage Return',
14 => 'Shift Out',
15 => 'Shift In',
16 => 'Data Link Escape',
17 => 'Device Control One',
18 => 'Device Control Two',
19 => 'Device Control Three',
20 => 'Device Control Four',
21 => 'Negative Acknowledgement',
22 => 'Synchronous Idle',
23 => 'End Of Transmission Block',
24 => 'Cancel',
25 => 'EndOfMedium',
25 => 'End Of Medium',
26 => 'Substitute',
27 => 'Escape',
28 => 'FileSeparator',
29 => 'GroupSeparator',
30 => 'RecordSeparator',
31 => 'UnitSeparator',
32 => 'Space',
28 => 'File Separator',
29 => 'Group Separator',
30 => 'Record Separator',
31 => 'Unit Separator',
33 => 'Exclamation',
34 => 'DoubleQuote',
34 => 'Double Quote',
35 => 'Number',
36 => 'Dollar',
37 => 'Percent',
38 => 'Ampersand',
39 => 'Quote',
40 => 'OpenBracket',
41 => 'CloseBracket',
40 => 'Open Bracket',
41 => 'Close Bracket',
42 => 'Asterisk',
43 => 'Plus',
44 => 'Comma',
46 => 'FullStop',
46 => 'Full Stop',
47 => 'Slash',
58 => 'Colon',
59 => 'Semicolon',
60 => 'LessThan',
60 => 'Less Than',
61 => 'Equals',
62 => 'GreaterThan',
63 => 'QuestionMark',
62 => 'Greater Than',
63 => 'Question Mark',
64 => 'At',
91 => 'OpenSquare',
91 => 'Open Square',
92 => 'Backslash',
93 => 'CloseSquare',
93 => 'Close Square',
94 => 'Caret',
96 => 'Backtick',
123 => 'OpenCurly',
123 => 'Open Curly',
124 => 'Pipe',
125 => 'CloseCurly',
125 => 'Close Curly',
126 => 'Tilde',
127 => 'Delete',
];
Expand Down Expand Up @@ -252,30 +251,36 @@ protected function toAscii(string $string): string
return $this->spellOutNonAscii(implode(' ', $words));
}

protected function separatorsToUnderscore(string $string): string
protected function separatorsToSpace(string $string): string
{
return preg_replace('/[' . $this->separators . '\s]+/', '_', trim($string));
return preg_replace('/[' . $this->separators . '\s_]+/', ' ', trim($string));
}

protected function spellOutAscii(string $string): string
{
$chunks = str_split($string);
$last = count($chunks) - 1;
foreach (str_split($string) as $i => $char) {
if (isset(self::ASCII_SPELLOUT[ord($char)])) {
$char = self::ASCII_SPELLOUT[ord($char)] . ($i < $last ? '_' : '');
$speltOut = [];
$current = '';

foreach (str_split($string) as $char) {
$ord = ord($char);
if (! isset(self::ASCII_SPELLOUT[$ord])) {
$current .= $char;
continue;
}
$chunks[$i] = $char;

$speltOut[] = $current;
$speltOut[] = self::ASCII_SPELLOUT[$ord];
$current = '';
}
$speltOut[] = $current;

return $this->spellOutLeadingDigits(implode('', $chunks));
return $this->spellOutLeadingDigits(implode(' ', $speltOut));
}

protected function toCase(string $string): string
{
assert(in_array($this->case, self::VALID_CASES));

$parts = explode('_', $string);
/** @var list<string> $parts */
$parts = array_filter(explode(' ', $string));
return match ($this->case) {
self::CAMEL_CASE => $this->toCamelCase($parts),
self::PASCAL_CASE => $this->toPascalCase($parts),
Expand All @@ -284,11 +289,11 @@ protected function toCase(string $string): string
};
}

protected function sanitizeReserved(string $string, array $reserved): string
protected function sanitizeReserved(string $string): string
{
assert($this->suffix !== null);

if (in_array(strtolower($string), $reserved, true)) {
if (in_array(strtolower($string), self::RESERVED, true)) {
return $string . $this->suffix;
}
return $string;
Expand All @@ -297,10 +302,10 @@ protected function sanitizeReserved(string $string, array $reserved): string
private function prepareSuffix(string|null $suffix, string $case): string|null
{
if ($suffix === null) {
return $suffix;
return null;
}

if ($suffix === '' || ! preg_match('/^[a-zA-Z0-9_\x80-\xff]*$/', $suffix)) {
if (! preg_match('/^[a-zA-Z0-9_\x80-\xff]+$/', $suffix)) {
throw NormalizerException::invalidSuffix($suffix);
}

Expand All @@ -312,46 +317,53 @@ private function prepareSuffix(string|null $suffix, string $case): string|null

private function spellOutNonAscii(string $string): string
{
$speltOut = '';
$speltOut = [];
$current = '';

$this->codePoints->setText($string);
/** @var string $char */
foreach ($this->codePoints->getPartsIterator() as $char) {
$ord = IntlChar::ord($char);
$speltOut .= $ord < 256 ? $char : $this->spellOutNonAsciiChar($ord);
$ord = IntlChar::ord($char);
if ($ord < 256) {
$current .= $char;
continue;
}

$speltOut[] = $current;
$speltOut[] = $this->spellOutNonAsciiChar($ord);
$current = '';
}
$speltOut[] = $current;

return $speltOut;
return implode(' ', $speltOut);
}

private function spellOutNonAsciiChar(int $ord): string
{
$speltOut = IntlChar::charName($ord);

// 'EURO SIGN' -> 'Euro'
return implode('', array_map(function (string $part): string {
return $part === 'SIGN' ? '' : ucfirst(strtolower($part));
}, explode(" ", $speltOut)));
// 'EURO SIGN' -> 'euro'
return implode(' ', array_map(function (string $part): string {
return $part === 'SIGN' ? '' : strtolower($part);
}, explode(' ', $speltOut)));
}

private function spellOutLeadingDigits(string $string): string
{
$chunks = str_split($string);
$speltOut = [];
$chunks = str_split($string);
foreach ($chunks as $i => $char) {
if ($i > 1 && $char === '_') {
$chunks[$i] = '';
break;
}

$ord = ord($char);

if (! isset(self::DIGIT_SPELLOUT[$ord])) {
$speltOut[] = implode('', array_slice($chunks, $i));
break;
}

$chunks[$i] = self::DIGIT_SPELLOUT[$ord] . '_';
$speltOut[] = self::DIGIT_SPELLOUT[$ord];
}

return implode('', $chunks);
return implode(' ', $speltOut);
}

/**
Expand Down
10 changes: 5 additions & 5 deletions src/ClassNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ public function normalize(string $label): string

private function normalizeLabel(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$cased = $this->toCase($speltOut);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);
$cased = $this->toCase($speltOut);

return $this->sanitizeReserved($cased, self::RESERVED);
return $this->sanitizeReserved($cased);
}
}
10 changes: 5 additions & 5 deletions src/ConstantNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ public function __construct(
*/
public function normalize(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$cased = $this->toCase($speltOut);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);
$cased = $this->toCase($speltOut);

return $this->sanitizeReserved($cased, self::RESERVED);
return $this->sanitizeReserved($cased);
}
}
6 changes: 3 additions & 3 deletions src/PropertyNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ public function __construct(string $case = self::CAMEL_CASE, string $separators
*/
public function normalize(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);

return $this->toCase($speltOut);
}
Expand Down
Loading

0 comments on commit 557ff32

Please sign in to comment.