Regular expressions and search patterns

Thursday, 20 September 2007, adz

Every Unix system offers several useful commands for finding files and searching them for strings. Together with programming techniques such as streams, pipes, redirections, and regular expressions they comprise very powerful tools ideal for administrative tasks.

1. Regular expressions

Regular expressions are descriptive patterns dealing with text strings. They have been part of Unix systems from their beginnings. They were invented by Mr Ken Thompson, one of Unix’s creators. Regular expressions are very effective tools for processing text information and they are decidedly worth the effort it takes to master them, but even a passing familiarity will pay dividends.

Symbol Replaces
. any character
^ matches up expressions placed after the operator at line beginning
$ matches up expressions placed before the operator at line end
\x special characters where x is a special character e.g. \$ stands for dollar sign
[list] replaces any character from a list or a range e.g. [0-9] or [a-d]
() groups regular expressions
? matches up exactly one character
a|b matches up a or b
* matches up zero or more times the proceeding character
+ matches up one or more times the proceeding character

2. Regular expressions and global characters

It’s worth noting that Bash shell prior to version 3.0, didn’t support regular expressions which were used then by small programs processing text strings, e.g. sed, awk, and grep. Bash supported only global expressions and wildcards.

Symbol Replaces
* any string of characters
? exactly one character
[list] any character from a list or a range e.g. [0-9] or [a-d]
[^list] any character not placed on a list or outside a range e.g. [^0-9] or [^a-d]
{} groups global expressions

Regular expressions search for a string in a text stream, whereas global expressions replace text strings. In the following simple example all the files having their names starting with the “a” character were deleted.

adam@laptop:~/Documents/$ ls
aa  abc        new.txt       example.txt
ab  error.txt  command.txt  all_about_console.txt

adam@laptop:~/Documents/$ rm a*

adam@laptop:~/Documents/$ ls
error.txt  new.txt  command.txt  example.txt

And now we will delete all the files which names do not start with a character coming from the range of b through z, and with the rest of their names comprised of any character strings.

adam@laptop:~/Documents/$ ls
aa  abc        new.txt       example.txt
ab  error.txt  command.txt  all_about_console.txt

adam@laptop:~/Documents/$ rm [^b-z]*

adam@laptop:~/Documents/$ ls
error.txt  new.txt  command.txt  example.txt

3. grep

grep is a commonly used program which finds text string(s) in its input stream, matching it against a given pattern called regular expression. Grep was initially developed by Mr Ken Thompson and can be found in every Unix system. Its functionality is controlled by several parameters:

  • -c – displays number of found lines only,
  • -n – displays line number within a file where a text string was found,
  • -w – searches for the whole words,
  • -x – searches for the whole lines only.
adam@laptop:~/Documents/$ dmesg | grep Mouse
[   15.436000] input: USB-PS/2 Optical Mouse as /class/input/input2
[   15.436000] input: USB HID v1.10 Mouse [USB-PS/2 Optical Mouse]
on usb-0000:00:1d.0-2 (continued from above row)

The above example shows “grep” in action. I redirected the dmesg output (displays log of kernel events) to input to the grep command with the help of the pipe operator (|). Regular expression consisted of one word – Mouse (case sensitivity matters). Grep command filtered input data and displayed only those lines which contained the string “Mouse”.

Let’s assume that for the purpose of our examples we created a text file expressions.txt with the following contents:

1. fruit
2. cycles
3. house
4. car
5. disk

So we will get:

grep ^[1-6]..c expressions.txt 
2. cycles
4. car

This expression will find all strings which start with numbers from within the range of 1 to 6, have any two characters (here a dot and a space), then one “c” character, and at the end any character string.

The following expression will show every row ending with “k” character.

grep k$ expressions.txt 
5. disk
grep c.*c  expressions.txt 
2. cycles

Our next example refers to another regular expression. This time we will search for any string beginning with c, followed by any string of characters, and ending with c. A special attention should be paid to “.*” string. As we remember regular expression operators, the “star” * operator will return zero or many characters placed before the operator. The second operator used in this example – . denotes any character. To put it simply, we are searching for any character repeated zero or more times.

grep "\(1\|4\)" expressions.txt 
1. fruit
4. car

The example is interesting from another point of view. We used characters which in Bash have special meanings. To change the way the Bash interprets the characters we had to use backslash \ and enclosed the expression in quotation marks. The utility of regular expressions is extensive so I encourage you to practice them on your own.

4. find

This command was created in order to look for files or directories. It is a very advanced program which is able to search for files taking into consideration their size, ownership, creation and modification times. The simplest invocation looks like this:

adam@laptop:~$ find . -name linux

First we declared starting_directory, then a searching_criteria, and actions performed on found files (directory is also a file in unix systems). So we have defined in this example the following parameters – current directory for starting_directory denoted as a dot (As I said earlier, every directory has a link to itself expressed by a dot) and the “linux” string for searching_criteria. The searching process starts from starting_directory and proceeds down inside the directory tree – namely first current directory is searched, then all subdirectories, of course if we have appropriate permissions to do read them. find enables users to use (I’d say – it should be used) global expressions together with wildcards, described earlier in this article.

find also gives us the choice to search file system according to a file type.

adam@laptop:~$ find . -type d

It is fairly easy to intuit that the expression will find all directories and subdirectories within the current directory. Of course, all criteria can be grouped.

adam@laptop:~$ find . -type d -name Documents

All files found in the example had “Documents” in their names. All possible file types presents the following table.

Parameter File type
b block device
c character device
d directory
f file
l symbolic link
s socket

find command invoked with -size value parameter will search for files with declared value. When we add to the “value” plus sign (e.g. -size +value) the program will find as well files greater than the value. You can set the search in reverse – adding minus sign (e.g. -size -value). Default block has 512 bits, but you can select other values:

  • c – bytes,
  • k – kilobytes,
  • M – megabytes,
  • G – gigabytes.

The following example will show how to search for files having sizes placed between 100 MB and 200 MB.

adam@laptop:~$ find . -size +100M -size -200M

Other search criteria are listed in the table:

-atime n The file was accessed n days ago.
-mtime n The file was modified n days ago.
-newer file Searched files were modified earlier than “file”.
-links n File has exactly n hard links.
-perm p File permissions, where p stands for access rights expressed numerically.
-user user Owner of the file named user.
-group group Owner of the file belonging to group group.
-empty Empty files.

Numerical options can be prefixed with + and - characters, which means respectively “greater than” and “smaller than” (similar to the pattern described in the “time” criterion).

As I mentioned earlier, the find command can perform certain operations on found files. The default action is -print, which writes file names together with their paths. Some shells need to have the option added to “find” every time.

Another allowable action for “find” is declared by the -ls option. It writes information about files in the same way as known command ls run with -lids parameter. The last important operation is “execution”, expressed by the -exec option. Executable file, a program, or sequence of programs, may be given as a parameter to the option.

adam@laptop:~$ find . -size +100M -size -150M -ls
2485508 115744 -rw-r--r--   1 adam     adam     118398976 maj  2 22:44 

find . -size +110M -size -150M -exec cp {} /home/adam/files/ \;

As we can see, the first from the two examples will write detailed information on found files. But the second one is more interesting from our point of view. All files with size greater than 110 MB and smaller than 150 MB will be copied to the files directory (/home/adam/files), where brace brackets inform “find” to issue the command on every found file, and backslash \ before semicolon ; character “protects” against wrong interpretation of the option by a system shell. The -exec option exists in another flavor – -ok. It works the same as “-exec” but asks a user for a permission for every action it is to perform.

find with -prune option excludes subdirectories.

find syntax gives users ability to create complex expressions and to merge criteria. The criteria are merged by default with the help of logical operators.

conjunction operator (AND) (-a) All criteria must be satisfied.
alternative operator (OR) (-o) One of the criteria or all of them must be satisfied.
negation operator (NOT) (\!) File negation.

It is possible to logically group expressions using brackets \( \).

One of more important parameter of find command is -print0. When applied the names of found files are pushed apart with the null character instead of a “new line” character. Let’s consider following example with the two files: june report.txt and junereport.txt.

adam@laptop:~$ find . -name "*report*" | xargs rm 
rm: cannot be deleted `./report': No such file or directory
rm: cannot be deleted `june.txt': No such file or directory

rm command was not able to remove june report.txt file as its name contains a space so its name was split into two namefiles.

find . -name "report*" -print0 | xargs -0 rm

To circumvent the difficulty we applied modified expression. print command got the -print0 option. Also the xargs program obtained additional -0 option. Attention! We can make use of -print0 option having GNU version of the find command.

This article is part of the Command line tricks series.
Go back to the previous article: UNIX Pipes, Streams and Redirections Explained »
Go back to the next article: System and enviromental variables »

Translated by P2O2, Proof-read by trashcat and chaddy