Previous
File Operations 
Next
Perl And CGI(Common Gateway Interface) 
Beginner's Tutorial for CGI using Perl Language
Regular Expression

Regular Expression

Regular Expression is what Perl is famous for. What is a regular expression? A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations.

$_ = "Hello World";
if (/World/) {
    print "It matches\n";
}
else {
    print "It doesn't match\n";
}

Here the string "World" is searched in the string "Hello World". This example will print the result, "It matches" as the word "World" is present. Another way do doing the same is.

my $string = "Hello World";
if ($string =~ /World/) {
    print "It matches\n";
}
else {
    print "It doesn't match\n";
}

That is how things are matched - but how to replace strings? This is how...

$string =~ s/re/replace/g;

This will replace all instances of Regular expression "re" with the string "replace". Example.

my $string = "Hello World";
$string =~ s/Hello/Hell/;
print "$string";

There are some special characters in Regular Expressions(or RE). These are {}[]()^$.|*+?\
They are also called metacharacters. They have special meaning in REs. Now lets see the meaning of every metacharacters...

MetacharactersDescription
*0 or more matches of the atom
+1 or more matches of the atom
?0 or 1 matches of the atom
{m}exactly m matches of the atom
{m,}m or more matches of the atom
{m,n}m through n (inclusive) matches of the atom

An atom is one of:

AtomDescription
(re)(where re is any regular expression) matches a match for re, with the matched string stored as a variable
[chars]a bracket expression, matching any one of the chars
.matches any single character
\cwhere c is alphanumeric (possibly followed by other characters), an escape.
^matches at the beginning of a line
$matches at the end of a line

Escapes

EscapeDescription
\dMatches any digit
\sMatches space
\wMatches any alphabet, any number and underscore(_)
\DMatches all non digit
\SMatches all non space
\nMatches newline
\rMatches carriage return

This is how it works...

REDescriptionExample
a?match 'a' 1 or 0 timesa
a*match 'a' 0 or more times, i.e., any number of timesaaaaaaa
a+match 'a' 1 or more times, i.e., at least onceaaaaaaa
a{2,5}match at least 2 times, but not more than 5 times.aaaa
a{2,}match at least 2 or more timesaaa
a{5}match exactly n timesaaaaa

Lets combine this metacharacters to do some actual work.

What does ".*" do? This will match any single character 0 or more times. In short it matches everything.
For example
$file =~ s/\#.*$//g;
will delete all comments in the variable '$file'. All characters after a '#' symbol to the end of the line will be replaced with an empty sting. This will delete all comments and create a lot of errors as it is not perfect(for example it will delete '#' even if it is inside string literals).

Lets see how to breakdown a E-mail Address
Example : whatever@wherever.com
.*@.*\..*
There are several problems with this re. The ".*@" part is not correct as it will match empty strings (@). Also after the @ sign. Make it
.+@.+\..*br> After the . symbol(\..*) only limited number of chars appear. All email IDs I have seen, they are between 2 to 4. There may be exceptions - but who cares?
.+@.+\..{2,4}
Now lets us make it more specific.
\w+@\w+\..{2,4}
But this will miss out some weird email IDs like this.is@my-email.id . So let us stick with the old one. Now to save the wanted parts.
(.+)@(.+)\.(.{2,4})

This will split the address to its important parts and store those parts. Lets see how it works

my $email = "this.is\@my-email.id";
if ($email =~ /(.+)@(.+)\.(.{2,4})/) {
	print "E-mail : $email\n";
	print "ID : $1\n";
	print "Domain : $2\n";
	print "Tail : $3\n";
}
This script will print out the result
E-mail : this.is@my-email.id
ID : this.is
Domain : my-email
Tail : id

You might have understood that '(' and ')' will store what is inside to $1, $2, $3 etc. The first () will be stored in $1, second in $2 etc.

Some letters can be given after the last '/' for more options.

CharactersUse
gDo a global search. Without this only the first instance is matched.
iIgnores case.
sTreats strings as a single line
mTreats strings as multiple lines

Combinations of these letter can be used.

my $string = "Hello World";
$string =~ s/L//gi;
print "$string";
will give
Heo Word

Another small example - one of my few original ideas. A very cheap encryption method - I like to call it Re-Encryption.

#!/usr/local/bin/perl

my $string = "Everyone is entitled to my opinion.";
print "Original String : $string\n";
$string =~ s/(.)(.)(.)(.)(.)(.)(.)/$3$6$7$1$2$4$5/g; #Encrypting
print "Encrypted String : $string\n";
$string =~ s/(.)(.)(.)(.)(.)(.)(.)/$4$5$1$6$7$2$3/g; #Decrypting
print "Decrypted String : $string\n";

To see Regular Expressions in action or to debug your Regular Expressions, get the program Regex Coach from http://weitz.de/regex-coach/. I recommend it highly.

Previous
File Operations 
Next
Perl And CGI(Common Gateway Interface) 
Subscribe to Feed