What is regex?
Regex, or regular expressions, are special sequences used to find or match patterns in strings. These sequences use metacharacters and other syntax to represent sets, ranges, or specific characters. For example, the expression [0-9] matches the range of numbers between 0 and 9.
Regular Expression Language - Quick Reference
Some terminology:
- pattern: regular expression pattern
- string: test string used to match the pattern
- digit: 0-9
- letter: a-z, A-Z
- symbol: !$%^&*()_+|~-=`{}[]:”;'<>?,./
- space: single white space, tab
- character: refers to a letter, digit or symbol
- Partial range: selections such as [a-f] or [g-p].
- Capitalized range: [A-Z].
- Digit range: [0-9].
- Symbol range: for example, [#$%&@].
- Mixed range: for example, [a-zA-Z0-9] includes all digits, lower and upper case letters. Do note that a range only specifies multiple alternatives for a single character in a pattern.To further understand how to define a range, it’s best to look at the full ASCII table in order to see how characters are ordered.
Some helpful basics:
Square Brackets ([]):
The name might sound scary, but it is nothing but the symbol: []. Some people also refer to square brackets as character class – a regular expression jargon word that means that it will match any character inside the bracket. For instance:
Pattern |
Matches |
[Pp]enguin |
Penguin, penguin |
[0123456789] |
(This will match any digit) |
[0oO] |
0, o, O |
Disjunction (|):
The pipe symbol means nothing but either 'A' or 'B', and it is helpful in cases where you want to select multiple strings simultaneously. For instance:
Pattern |
Matches |
A|B|C |
A, B, C |
Black|White |
Black, White |
[Bb]lack|[Ww]hite |
Black, black, White, white |
Question Mark (?):
The question mark symbol means that the character it comes after is optional. For instance:
Pattern |
Matches |
Ab?c |
Ac, Abc |
Colou?r |
Color, Colour |
Asterisk (*):
The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:
Pattern |
Matches |
Sh* |
(0 or more of earlier character h) S, Sh, Shh, Shhh. |
(banana)* |
(0 or more of earlier banana. This will also match with nothing, but most regex engines will ignore it or give you a warning in that case) banana, bananabanana, bananabananabanana. |
Plus (+):
The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:
Pattern |
Matches |
Sh+ |
(1 or more of earlier character h) Sh, Shh, Shhh. |
(banana)+ |
(1 or more of the earlier banana) banana, bananabanana, bananabananabanana. |
Difference between Asterisk (*) and Plus(+):
The difference between the asterisk confuses many people; even the experts sometimes must look at the internet for their differences. However, there is an effortless way to remember the distinction between them.
Imagine
you have a number 1, and you multiply it with 0:
1*0
= 0 or more occurrences of earlier character or group.
Now suppose that you have the same number 1, and you add it with 0:1+0 = 1 or more occurrences of an earlier character or group.
It is that simple when you try to understand things intuitively.
Negation (^):
Negation has two everyday use cases:
1. Inside square brackets, it will search for the negation of whatever is inside the brackets. For instance:
Pattern |
Matches |
[^Aa] |
It will match with anything that is not A or a |
[^0123456789] |
It will match anything that is not a digit |
2. It can also be used as an anchor to search for expressions at the start of the line(s) only. For instance:
Pattern |
Matches |
^Apple |
It will match with every Apple that is at the start of any line in the text |
^(Apple|Banana) |
It will match with every Apple and Banana that is at the start of any line in the text |
Dollar ($):
A dollar is used to search for expressions at the end of the line. For instance:
Pattern |
Matches |
$[0123456789] |
It will match with any digit at the end of any line in the text. |
$([Pp]anda) |
It will match with every Panda and panda at the end of any line in the text. |