Audvik Labs

What is Python Regular Expression?

Introduction

A Python regular expression is a sequence of metacharacters, that defines a search pattern. We use these patterns in a string-searching algorithm to “find” or “find and replace” on strings. They are strings in which “what to match” is defined or written.The term “regular expressions” is frequently shortened to “regex” at some places.
Regular expressions are typically used in many applications that involve a lot of text processing. Many programming languages include support for regular expressions in the language syntax (Perl, Ruby, etc). Where as some languages like C, C++, and Python, support regular expressions through extension libraries.

What are the uses of Python Regular Expression ?

As the task of regex is to find and/or replace the given pattern, they can be used in a lot of places where pattern matching is at top priority. Some of its major applications are as follows:

  • Text Editors
  • Search Engines and Search
  • Mechanism back-ends of websites and APIs
  • Code Editors and IDE’s
  • Data Entry Software
  • Form and User Input data validation
  • Data Analytics, Web Scraping

PYTHON REGEX – METACHARACTERS

Every character in a Python Regex is either a metacharacter or a regular character. A metacharacter has a special meaning whereas a regular character matches itself.
A raw string does not handle backslashes in any special way. For this, prepend an ‘r’ before the pattern. Without this, you may have to use ‘\\\\’ for a single backslash character. But with this, you only need r’\’.

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python “re” module provides regular expression support.
In Python a regular expression search is typically written as:
match = re.search(pat, str)
The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern ‘word:’ followed by a 3 letter word
import re
str = ‘an example word:cat!!’
match = re.search(r’word:\w\w\w’, str)

#If-statement after search() tests if it succeeded

if match:

print(‘found’, match . group()) ## ‘found word:cat’

else:
print(‘did not find’)


The code match = re.search(pat, str) stores the search result in a variable named “match”. Then the if-statement tests the match — if true the search succeeded and match . group() is the matching text (e.g. ‘word:cat’). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.
The ‘r’ at the start of the pattern string designates a python “raw” string which passes through backslashes without change which is very handy for regular expressions

Basic Patterns

  • The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:
  • a, X, 9, < — ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( )
  • (a period) — matches any single character except newline ‘\n’
  • \w — (lowercase w) matches a “word” character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
  • \b — boundary between word and non-word
  • \s — (lowercase s) matches a single whitespace character — space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
  • \t, \n, \r — tab, newline, retur
  • \d — decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)
  • ^ = start, $ = end — match the start or end of the string
  • \ — inhibit the “specialness” of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can try putting a slash in front of it, \@. If its not a valid escape sequence, like \c, your python program will halt with an error.

Conclusion

Regular Expressions are a way to validate data or to search and replace characters in our strings. Regex consists of metacharacters, quantifiers, and literal characters that can be used to test our strings to see if it passes a validation test or to search and replace Matches.
Regex can be a little overwhelming at first, but once you get it, it’s a little bit like riding a bike. It’ll be in the back of your memory and super easy to pick up again.

Leave a comment

Your email address will not be published. Required fields are marked *