Regular Expression

Regular Expression is what Perl is famous for. What is a regular expression? A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations.

$_ = "Hello World";
if (/World/) {
    print "It matches\n";
}
else {
    print "It doesn't match\n";
}

Here the string "World" is searched in the string "Hello World". This example will print the result, "It matches" as the word "World" is present. Another way do doing the same is.

my $string = "Hello World";
if ($string =~ /World/) {
    print "It matches\n";
}
else {
    print "It doesn't match\n";
}

That is how things are matched - but how to replace strings? This is how...

$string =~ s/re/replace/g;

This will replace all instances of Regular expression "re" with the string "replace". Example.

my $string = "Hello World";
$string =~ s/Hello/Hell/;
print "$string";

There are some special characters in Regular Expressions(or RE). These are {}[]()^$.|*+?\
They are also called metacharacters. They have special meaning in REs. Now lets see the meaning of every metacharacters...

Metacharacters	Description
*	0 or more matches of the atom
+	1 or more matches of the atom
?	0 or 1 matches of the atom
{m}	exactly m matches of the atom
{m,}	m or more matches of the atom
{m,n}	m through n (inclusive) matches of the atom

An atom is one of:

Atom	Description
(re)	(where re is any regular expression) matches a match for re, with the matched string stored as a variable
[chars]	a bracket expression, matching any one of the chars
.	matches any single character
\c	where c is alphanumeric (possibly followed by other characters), an escape.
^	matches at the beginning of a line
$	matches at the end of a line

Escapes

Escape	Description
\d	Matches any digit
\s	Matches space
\w	Matches any alphabet, any number and underscore(_)
\D	Matches all non digit
\S	Matches all non space
\n	Matches newline
\r	Matches carriage return

This is how it works...

RE	Description	Example
a?	match 'a' 1 or 0 times	a
a*	match 'a' 0 or more times, i.e., any number of times	aaaaaaa
a+	match 'a' 1 or more times, i.e., at least once	aaaaaaa
a{2,5}	match at least 2 times, but not more than 5 times.	aaaa
a{2,}	match at least 2 or more times	aaa
a{5}	match exactly n times	aaaaa

Lets combine this metacharacters to do some actual work.

What does ".*" do? This will match any single character 0 or more times. In short it matches everything.
For example
$file =~ s/\#.*$//g;
will delete all comments in the variable '$file'. All characters after a '#' symbol to the end of the line will be replaced with an empty sting. This will delete all comments and create a lot of errors as it is not perfect(for example it will delete '#' even if it is inside string literals).

Lets see how to breakdown a E-mail Address
Example : whatever@wherever.com
.*@.*\..*
There are several problems with this re. The ".*@" part is not correct as it will match empty strings (@). Also after the @ sign. Make it
.+@.+\..*br> After the . symbol(\..*) only limited number of chars appear. All email IDs I have seen, they are between 2 to 4. There may be exceptions - but who cares?
.+@.+\..{2,4}
Now lets us make it more specific.
\w+@\w+\..{2,4}
But this will miss out some weird email IDs like this.is@my-email.id . So let us stick with the old one. Now to save the wanted parts.
(.+)@(.+)\.(.{2,4})

This will split the address to its important parts and store those parts. Lets see how it works

my $email = "this.is\@my-email.id";
if ($email =~ /(.+)@(.+)\.(.{2,4})/) {
	print "E-mail : $email\n";
	print "ID : $1\n";
	print "Domain : $2\n";
	print "Tail : $3\n";
}

This script will print out the result

E-mail : this.is@my-email.id
ID : this.is
Domain : my-email
Tail : id

You might have understood that '(' and ')' will store what is inside to $1, $2, $3 etc. The first () will be stored in $1, second in $2 etc.

Some letters can be given after the last '/' for more options.

Characters	Use
g	Do a global search. Without this only the first instance is matched.
i	Ignores case.
s	Treats strings as a single line
m	Treats strings as multiple lines

Combinations of these letter can be used.

my $string = "Hello World";
$string =~ s/L//gi;
print "$string";

will give
Heo Word

Another small example - one of my few original ideas. A very cheap encryption method - I like to call it Re-Encryption.

#!/usr/local/bin/perl

my $string = "Everyone is entitled to my opinion.";
print "Original String : $string\n";
$string =~ s/(.)(.)(.)(.)(.)(.)(.)/$3$6$7$1$2$4$5/g; #Encrypting
print "Encrypted String : $string\n";
$string =~ s/(.)(.)(.)(.)(.)(.)(.)/$4$5$1$6$7$2$3/g; #Decrypting
print "Decrypted String : $string\n";

To see Regular Expressions in action or to debug your Regular Expressions, get the program Regex Coach from http://weitz.de/regex-coach/. I recommend it highly.

Previous
File Operations

Next
Perl And CGI(Common Gateway Interface)

Contents