Advanced Regular Expressions in sed, grep and vim
Category: Programming
Date: February 2023
Views: 826
Regular expressions are a powerful tool for searching and manipulating text. They allow you to create complex search patterns that can match specific combinations of characters, words, and patterns. Advanced regular expressions build upon the basic syntax to provide even more sophisticated search and replacement capabilities. In this article, we will explore some advanced regular expressions techniques and their applications.
Lookahead and Lookbehind Assertions:
These are special types of zero-width assertions that allow you to match patterns only if they are preceded or followed by specific patterns. For example, the regular expression (?<=\d{3})\d{4} will match four digits only if they are preceded by three digits. Lookahead and lookbehind assertions are useful for matching patterns in specific contexts.
Backreferences:
A backreference allows you to reference a previously matched group in your regular expression. For example, the regular expression (\w+)\s+\1 will match any word followed by whitespace, followed by the same word. Backreferences are useful for matching repeated patterns.
Conditional Statements:
Conditional statements allow you to match patterns based on certain conditions. For example, the regular expression (?(?=regex)then|else) will match "then" if the pattern "regex" matches, otherwise it will match "else". Conditional statements are useful for creating complex search and replace patterns.
Non-Capturing Groups:
A non-capturing group allows you to group together parts of your regular expression without capturing them. This is useful when you want to apply a quantifier to a group without capturing the group itself. For example, the regular expression (?:abc)+ will match "abc" one or more times, without capturing the group "abc".
Atomic Groups:
An atomic group is a type of non-capturing group that prevents backtracking. This means that once the group has been matched, it cannot be reconsidered during the matching process. Atomic groups are useful when you want to prevent catastrophic backtracking.
Grep: Using lookarounds
The following example shows how to find all occurrences of the word "foo" that are not immediately followed by the word "bar":
grep -P 'foo(?!\s+bar)' file.txt
The -P option enables Perl-style regular expressions, including lookarounds.
grep: Using backreferences
The following example shows how to find all lines that contain a repeated word:
grep -P '\b(\w+)\b.*\b\1\b' file.txt
The \b(\w+)\b pattern matches a word and captures it in group 1. The .* pattern matches any characters in between. Finally, the \b\1\b pattern matches the same word again, using a backreference to the first group.
Sed: Using capture groups in replacements
The following example shows how to replace all occurrences of "foo bar" with "bar foo":
sed -E 's/(foo) (bar)/\2 \1/g' file.txt
The (foo) and (bar) patterns capture the two words in separate groups, which are then referenced in the replacement string as \2 \1.
sed: Using conditional expressions
The following example shows how to remove all lines that contain the word "foo" but not the word "bar":
sed -E '/foo/ { /bar/! d }' file.txt
The /foo/ pattern matches lines that contain the word "foo". The { /bar/! d } expression is only executed for lines that match the previous pattern, and it deletes the line (d) if it does not contain the word "bar".
Vim: Using lookaheads
The following example shows how to find all occurrences of the word "foo" that are immediately followed by a comma:
:%s/foo\@=,//g
The foo\@=, pattern matches "foo" followed by a comma, using a positive lookahead (\@=).
A positive lookahead is a type of assertion in regular expressions that allows you to match a pattern only if it is followed by another pattern. It is written as (?=pattern) where "pattern" is the pattern that should come after the current pattern being matched.
For example, if you have the string "I love cats and dogs", and you want to match only the word "cats" if it is followed by the word "and", you could use the positive lookahead assertion like this: cats(?= and)
In this case, the regular expression will match the word "cats" only if it is followed by the string " and", but the " and" will not be included in the match.
Positive lookaheads can be useful when you want to match a pattern only if it appears in a certain context, without actually including that context in the match. They are supported in many regular expression engines, including those used in programming languages like Perl, Python, and JavaScript.
vim: Using non-capturing groups
In Vim, you can use non-capturing groups by using the \%(...\) syntax. Here's an example:
Suppose you have a text file with a list of names in the format "Last Name, First Name". You want to reverse the order of the names and format them as "First Name Last Name". You can use a non-capturing group to match the comma and the space after the comma without capturing them, and then use a backreference to swap the order of the names. Here's how you can do it in Vim:
:%s/\(\w\+\), \(\w\+\)/\2 \1/g
In this example, we're using capturing groups to match the last name and the first name, but we're using a non-capturing group to match the comma and the space after the comma. We're using \%(...\) to create the non-capturing group, and we're using \(\) to create the capturing groups.
The replacement string \2 \1 swaps the order of the names, putting the first name first and the last name last.
Here's an example of what the text file might look like before and after applying the Vim command:
Before:
Smith, John
Doe, Jane
Jones, Tom
After:
John Smith
Jane Doe
Tom Jones
In summary, regular expressions are a powerful tool for searching and manipulating text. By mastering advanced regular expression techniques, you can greatly enhance your productivity when working with text files, whether you're using command-line tools like grep, sed, and awk, or text editors like vim and emacs.
Advanced regular expressions are a powerful tool for manipulating text, but they can also be complex and difficult to read. It is important to use them judiciously and to test them thoroughly before applying them to large datasets. With a little practice, however, you can use advanced regular expressions to automate complex text manipulation tasks and save time and effort in your work.
0 Comments, latest