Skip to content

[Validator] Add NoBannedWords constraint #49868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Spomky
Copy link
Contributor

@Spomky Spomky commented Mar 30, 2023

Q A
Branch? 6.3
Bug fix? no
New feature? yes
Deprecations? no
Tickets none->
License MIT
Doc PR symfony/symfony-docs#[TODO]

This PR adds a new constraint NoBannedWords that can detect forbidden words in a text against a dictionary

@Spomky Spomky force-pushed the features/validator-no-banned-words branch from d7fdc10 to 41af25c Compare March 30, 2023 19:38
@Spomky
Copy link
Contributor Author

Spomky commented Mar 30, 2023

@nicolas-grekas, @stof: many thanks for the code review. All comments have been addressed 👍🏽.

@Spomky Spomky force-pushed the features/validator-no-banned-words branch from 41af25c to b40df64 Compare March 30, 2023 19:40
@Spomky Spomky marked this pull request as draft March 30, 2023 21:12
/**
* @param array<string> $dictionary
*/
public function __construct(array $dictionary = [], mixed $options = null, array $groups = null, mixed $payload = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should keep $options as the first argument. Otherwise, this will break the XML and YAML loaders, which only pass a single argument with all options. Named arguments for all available options must come after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks.

'n' => '(n|\^/|\|V|\|\\\||/\\/|\[\\\]|\<\\\>|\{\\\}|\]\\\[|//|\^|\[\]|/V|₪)',
'o' => '(o|0|\(\)|oh|\[\]|¤|°|\(\[\]\))',
'p' => '(p|\|\*|\|o|\|º|\|\^\(o\)|\|\>|\|"|9|\[\]D|\|̊|\|7|\?|/\*|¶|\|D)',
'q' => '(q|\(_,\)|\(\)_|0_|°\||\<\||0\.)',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All those groups should be non-capturing, to avoid reaching the limit on the number of capturing groups in PCRE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the brackets should just be removed actually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the behaviour is not the same with [] instead of ().
As an example, "$yMph0NY" is not detected because of "ph". Same if I change "symfony" into "symfohny". It looks like multicharacters are not understood when grouping with []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked to use non-capturing groups, not character classes

throw new UnexpectedValueException($value, 'string');
}

$toL33tRegex = fn (string $data): string => implode('', array_map(fn (string $char): string => strtr($char, self::LEET_MAP), str_split($data)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about all characters that are not keys in self::LEET_MAP (which include uppercase letters btw, as you don't enforce lowercase in the dictionary) ? Having a . in the dictionary would lead to pretty bad things (it gets even worse when you consider that anything else than a lowercase ASCII letter is injected as is in the regex, allowing to inject all regex special chars in it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uppercase, I use the modifier i. You are right, the dictionary shall only contain lowercase alpha otherwize it does not work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strtr does not care that you use the i modifier in the final regex: it won't turn a A in the dictionary word into l33t equivalents

You are right, the dictionary shall only contain lowercase alpha otherwize it does not work.

This must be enforced, because it opens the door to bad behavior otherwise (and it is actually a nasty requirement as it would also add that requirement for the non-l33t banned words, which don't have this restriction otherwise)

@Spomky
Copy link
Contributor Author

Spomky commented Mar 31, 2023

This constraint is not mature enough. There are many use cases where it doesn't work as expected.
Moreover, it is limited to words written with [a-z] and is therefore not designed for other scripts (Greek, Japanese, Chinese, Cyrillic...).
I prefer to stop and think about it in the future.
Anyway, thank you very much for the review and feedback so far.

@Spomky Spomky closed this Mar 31, 2023
@Spomky Spomky deleted the features/validator-no-banned-words branch March 31, 2023 20:02
@stof
Copy link
Member

stof commented Mar 31, 2023

I think the whole part being immature is the attempt at forbidding l33t variants of the banned words. Without that, the constraint would be fine.

@Spomky
Copy link
Contributor Author

Spomky commented Apr 1, 2023

Thanks @stof. I think you are right. I revised my mind and simplified the l33t part.

Edit: I cannot reopen, I will recreate a new PR in the next days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants