Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix word separation with snake_case #1

Merged
merged 4 commits into from
Oct 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 25 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@ Utilities for generating PHP code.

## Normalizers

The normalizers generate PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
The normalizers generate readable PHP labels (class names, namespaces, property names, etc) from valid UTF-8 strings,
[transliterating] them to ASCII and spelling out any invalid characters.

### Usage:

The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Shop"):
The following code (forgive the Japanese - a certain translation tool tells me it means "Pet Store"):
```php
<?php

use Kynx\CodeUtls\ClassNameNormalizer;

$normalizer = new ClassNameNormalizer();
$normalizer = new ClassNameNormalizer('Controller');
$namespace = $normalizer->normalize('ペット \ ショップ');
echo $namespace;
```
Expand Down Expand Up @@ -48,23 +48,34 @@ See the [tests] for more examples.

### Why?

You should never generate code from untrusted user input. But there are a few cases where you may want to do it with
mostly-trusted input. In my case, it's generating classes and properties from an OpenAPI specification, where there are
no restrictions on the characters present.
You must **never** run code generated from untrusted user input. But there are a few cases where you do want to
_output_ code generated from (mostly) trusted input.

In my case, I need to generate classes and properties from an OpenAPI specification. There are no hard-and-fast rules
on the characters present, just a vague "it is RECOMMENDED to follow common programming naming conventions". Whatever
they are.

### How?

`AbstractNormalizer` uses `ext-intl`'s [Transliterator] to perform the transliteration. Where a character has no
Each normalizer uses `ext-intl`'s [Transliterator] to turn the UTF-8 string into Latin-ASCII. Where a character has no
equivalent in ASCII (the "€" symbol is a good example), it uses the [Unicode name] of the character to spell it out (to
`Euro`). For ASCII characters that are not valid in a PHP label, it provides it's own spell outs: for instance, a
backtick "`" becomes `Backtick`.
`Euro`, after some minor clean-up). For ASCII characters that are not valid in a PHP label, it provides its own spell
outs. For instance, a backtick "&#96;" becomes `Backtick`.

Initial digits are also spelt out: "123foo" becomes `OneTwoThreeFoo`. Finally reserved words are suffixed with a
user-supplied string so they don't mess things up. In the first usage example above, if we normalized "class" it would
become `ClassController`.

The results may not be pretty. If for some mad reason your input contains ` ͖` - put your glasses on! - the label will
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a chance of being as unique as
the original. But speaking of which...

Initial digits are also spelt out - "123 foo" becomes `OneTwoThreeFoo` - and finally reserved words are suffixed with a
user-supplied string so they don't mess things up: "class" can become `ClassController`.
### Uniqueness

The results may not be pretty. For instance, if your input contains ` ͖` - put your glasses on! - the class name will
contain `CombiningRightArrowheadAndUpArrowheadBelow`. But it _is_ valid PHP, and stands a good chance of being as unique
as the original.
The normalization process reduces around a million Unicode code points down to just 162 ASCII characters. Then we mangle
it further by stripping separators, reducing whitespace and turning it into camelCase, snake_case or whatever
your programming preference. It's gonna be lossy - nothing we can do about that. Ideally this library would provide a
utility for guaranteeing uniqueness across a set of labels, but I haven't written it yet. Feel free to contribute!


[transliterating]: https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration
Expand Down
6 changes: 6 additions & 0 deletions phpunit.xml.dist
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,10 @@
<directory suffix=".php">src</directory>
</include>
</coverage>

<php>
<!-- Seems to be needed by CI's PHP8.2-RC1? Not needed in PHP8.2-dev locally! -->
<ini name="assert.exception" value="1" />
<ini name="assert.warning" value="0" />
</php>
</phpunit>
152 changes: 82 additions & 70 deletions src/AbstractNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,11 @@
use IntlCodePointBreakIterator;
use Transliterator;

use function array_filter;
use function array_map;
use function array_shift;
use function array_slice;
use function assert;
use function count;
use function explode;
use function implode;
use function in_array;
Expand All @@ -27,7 +28,6 @@
use function strtolower;
use function substr;
use function trim;
use function ucfirst;

/**
* Utility for generating valid PHP labels from UTF-8 strings
Expand Down Expand Up @@ -132,67 +132,66 @@ abstract class AbstractNormalizer implements NormalizerInterface
];

private const ASCII_SPELLOUT = [
1 => 'StartOfHeader',
2 => 'StartOfText',
3 => 'EndOfText',
4 => 'EndOfTransmission',
1 => 'Start Of Header',
2 => 'Start Of Text',
3 => 'End Of Text',
4 => 'End Of Transmission',
5 => 'Enquiry',
6 => 'Acknowledgement',
7 => 'Bell',
8 => 'Backspace',
9 => 'HorizontalTab',
10 => 'LineFeed',
11 => 'VerticalTab',
12 => 'FormFeed',
13 => 'CarriageReturn',
14 => 'ShiftOut',
15 => 'ShiftIn',
16 => 'DataLinkEscape',
17 => 'DeviceControlOne',
18 => 'DeviceControlTwo',
19 => 'DeviceControlThree',
20 => 'DeviceControlFour',
21 => 'NegativeAcknowledgement',
22 => 'SynchronousIdle',
23 => 'EndOfTransmissionBlock',
9 => 'Horizontal Tab',
10 => 'Line Feed',
11 => 'Vertical Tab',
12 => 'Form Feed',
13 => 'Carriage Return',
14 => 'Shift Out',
15 => 'Shift In',
16 => 'Data Link Escape',
17 => 'Device Control One',
18 => 'Device Control Two',
19 => 'Device Control Three',
20 => 'Device Control Four',
21 => 'Negative Acknowledgement',
22 => 'Synchronous Idle',
23 => 'End Of Transmission Block',
24 => 'Cancel',
25 => 'EndOfMedium',
25 => 'End Of Medium',
26 => 'Substitute',
27 => 'Escape',
28 => 'FileSeparator',
29 => 'GroupSeparator',
30 => 'RecordSeparator',
31 => 'UnitSeparator',
32 => 'Space',
28 => 'File Separator',
29 => 'Group Separator',
30 => 'Record Separator',
31 => 'Unit Separator',
33 => 'Exclamation',
34 => 'DoubleQuote',
34 => 'Double Quote',
35 => 'Number',
36 => 'Dollar',
37 => 'Percent',
38 => 'Ampersand',
39 => 'Quote',
40 => 'OpenBracket',
41 => 'CloseBracket',
40 => 'Open Bracket',
41 => 'Close Bracket',
42 => 'Asterisk',
43 => 'Plus',
44 => 'Comma',
46 => 'FullStop',
46 => 'Full Stop',
47 => 'Slash',
58 => 'Colon',
59 => 'Semicolon',
60 => 'LessThan',
60 => 'Less Than',
61 => 'Equals',
62 => 'GreaterThan',
63 => 'QuestionMark',
62 => 'Greater Than',
63 => 'Question Mark',
64 => 'At',
91 => 'OpenSquare',
91 => 'Open Square',
92 => 'Backslash',
93 => 'CloseSquare',
93 => 'Close Square',
94 => 'Caret',
96 => 'Backtick',
123 => 'OpenCurly',
123 => 'Open Curly',
124 => 'Pipe',
125 => 'CloseCurly',
125 => 'Close Curly',
126 => 'Tilde',
127 => 'Delete',
];
Expand Down Expand Up @@ -252,30 +251,36 @@ protected function toAscii(string $string): string
return $this->spellOutNonAscii(implode(' ', $words));
}

protected function separatorsToUnderscore(string $string): string
protected function separatorsToSpace(string $string): string
{
return preg_replace('/[' . $this->separators . '\s]+/', '_', trim($string));
return preg_replace('/[' . $this->separators . '\s_]+/', ' ', trim($string));
}

protected function spellOutAscii(string $string): string
{
$chunks = str_split($string);
$last = count($chunks) - 1;
foreach (str_split($string) as $i => $char) {
if (isset(self::ASCII_SPELLOUT[ord($char)])) {
$char = self::ASCII_SPELLOUT[ord($char)] . ($i < $last ? '_' : '');
$speltOut = [];
$current = '';

foreach (str_split($string) as $char) {
$ord = ord($char);
if (! isset(self::ASCII_SPELLOUT[$ord])) {
$current .= $char;
continue;
}
$chunks[$i] = $char;

$speltOut[] = $current;
$speltOut[] = self::ASCII_SPELLOUT[$ord];
$current = '';
}
$speltOut[] = $current;

return $this->spellOutLeadingDigits(implode('', $chunks));
return $this->spellOutLeadingDigits(implode(' ', $speltOut));
}

protected function toCase(string $string): string
{
assert(in_array($this->case, self::VALID_CASES));

$parts = explode('_', $string);
/** @var list<string> $parts */
$parts = array_filter(explode(' ', $string));
return match ($this->case) {
self::CAMEL_CASE => $this->toCamelCase($parts),
self::PASCAL_CASE => $this->toPascalCase($parts),
Expand All @@ -284,11 +289,11 @@ protected function toCase(string $string): string
};
}

protected function sanitizeReserved(string $string, array $reserved): string
protected function sanitizeReserved(string $string): string
{
assert($this->suffix !== null);

if (in_array(strtolower($string), $reserved, true)) {
if (in_array(strtolower($string), self::RESERVED, true)) {
return $string . $this->suffix;
}
return $string;
Expand All @@ -297,10 +302,10 @@ protected function sanitizeReserved(string $string, array $reserved): string
private function prepareSuffix(string|null $suffix, string $case): string|null
{
if ($suffix === null) {
return $suffix;
return null;
}

if ($suffix === '' || ! preg_match('/^[a-zA-Z0-9_\x80-\xff]*$/', $suffix)) {
if (! preg_match('/^[a-zA-Z0-9_\x80-\xff]+$/', $suffix)) {
throw NormalizerException::invalidSuffix($suffix);
}

Expand All @@ -312,46 +317,53 @@ private function prepareSuffix(string|null $suffix, string $case): string|null

private function spellOutNonAscii(string $string): string
{
$speltOut = '';
$speltOut = [];
$current = '';

$this->codePoints->setText($string);
/** @var string $char */
foreach ($this->codePoints->getPartsIterator() as $char) {
$ord = IntlChar::ord($char);
$speltOut .= $ord < 256 ? $char : $this->spellOutNonAsciiChar($ord);
$ord = IntlChar::ord($char);
if ($ord < 256) {
$current .= $char;
continue;
}

$speltOut[] = $current;
$speltOut[] = $this->spellOutNonAsciiChar($ord);
$current = '';
}
$speltOut[] = $current;

return $speltOut;
return implode(' ', $speltOut);
}

private function spellOutNonAsciiChar(int $ord): string
{
$speltOut = IntlChar::charName($ord);

// 'EURO SIGN' -> 'Euro'
return implode('', array_map(function (string $part): string {
return $part === 'SIGN' ? '' : ucfirst(strtolower($part));
}, explode(" ", $speltOut)));
// 'EURO SIGN' -> 'euro'
return implode(' ', array_map(function (string $part): string {
return $part === 'SIGN' ? '' : strtolower($part);
}, explode(' ', $speltOut)));
}

private function spellOutLeadingDigits(string $string): string
{
$chunks = str_split($string);
$speltOut = [];
$chunks = str_split($string);
foreach ($chunks as $i => $char) {
if ($i > 1 && $char === '_') {
$chunks[$i] = '';
break;
}

$ord = ord($char);

if (! isset(self::DIGIT_SPELLOUT[$ord])) {
$speltOut[] = implode('', array_slice($chunks, $i));
break;
}

$chunks[$i] = self::DIGIT_SPELLOUT[$ord] . '_';
$speltOut[] = self::DIGIT_SPELLOUT[$ord];
}

return implode('', $chunks);
return implode(' ', $speltOut);
}

/**
Expand Down
10 changes: 5 additions & 5 deletions src/ClassNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ public function normalize(string $label): string

private function normalizeLabel(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$cased = $this->toCase($speltOut);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);
$cased = $this->toCase($speltOut);

return $this->sanitizeReserved($cased, self::RESERVED);
return $this->sanitizeReserved($cased);
}
}
10 changes: 5 additions & 5 deletions src/ConstantNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ public function __construct(
*/
public function normalize(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$cased = $this->toCase($speltOut);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);
$cased = $this->toCase($speltOut);

return $this->sanitizeReserved($cased, self::RESERVED);
return $this->sanitizeReserved($cased);
}
}
6 changes: 3 additions & 3 deletions src/PropertyNameNormalizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ public function __construct(string $case = self::CAMEL_CASE, string $separators
*/
public function normalize(string $label): string
{
$ascii = $this->toAscii($label);
$underscored = $this->separatorsToUnderscore($ascii);
$speltOut = $this->spellOutAscii($underscored);
$ascii = $this->toAscii($label);
$spaced = $this->separatorsToSpace($ascii);
$speltOut = $this->spellOutAscii($spaced);

return $this->toCase($speltOut);
}
Expand Down
Loading