Grep parameters every bioinformatician should know

Grep parameters every bioinformatician should know

Your shell, along with the myriad command line programs it exposes is clearly a great friend when it comes to file manipulation. And let’s face it, file manipulation is a big part of a bioinformatician’s daily workload. Now, since we rarely have the time to review all the options offered by the different programs I thought I’d list some really useful ones from grep.

I expect everyone to know what grep is and what it does so let’s just get to the params 🙂

-i, –ignore-case

Well this is an easy one and it’s useful when you don’t feel compelled to type your search pattern in the exact case structure which is used inside the file (or standard input) of your choosing.
Use case:

# looking for gene entries in both Human and Mouse annotation files 
# using Gene Symbols (ie: HOXB4 vs Hoxb4). 

> grep -i 'hoxb4' human_or_mouse_annotation_file.txt

-w, –word-regexp

Select only those lines containing matches that form whole words.
This one can be a godsend sometimes. For example: using grep to search for ‘chr1’ and failing to specify -w will leave you feeling sad as you realize you also extracted lines for chr10, 11, 12, 13, etc.. But now you know better !
Use case:

# extracting Chromosome 1 features from a gtf. 

> grep -w 'chr1' genes.gtf

-f FILE, –file=FILE

Want to look for a various patterns in one go ?
Dump them in a file of their own, 1 pattern per line, and extract the whole shebang using a single call !
Use case:

# Fetch the MAFs for a set of hundreds of SNPs for which you only have the rsIDs.

> grep -w -F file_of_patterns.txt ESP.vcf

# (Just pipe the output into a small awk script, and bam ! Mission success.)

-v, –invert-match

Pretty much self explanatory: it returns lines that do NOT contain your pattern.
Use case:

#  generate a "female" version of a gtf. (because I can..)

> grep -v 'chrY' unisex.gtf

#  and you're done !

And last but not least:

-B NUM, –before-context NUM
-A NUM, –after-context NUM

These basically ask grep to print the N (NUM) lines Before of After the pattern hit.
Use case:

# Extract the reads from a fastq which contain a specific nucleotide sequence 

> grep -B 1 -A 2 -i 'ccggaacgcagcgaagtggaacgcgcgaactgcgaa' reads.fastq

# (Granted, the sequence's reverse complement is not accounted for...)

Of course, feel free to mix and match to get the desired effect…
And don’t forget that you can pipe a grep outputs… Into grep !

So there you have it.. My favorite grep params.
If I omitted your favorite grep param, please leave a comment !

By | 2017-04-29T15:35:48+00:00 November 27, 2015|Categories: Data Analysis, Shell scripting|Tags: , |0 Comments

About the Author:

Originally trained in molecular biology, I quickly realized my heart lied with bioinformatics ! (How can anyone be presented an HMM and not fall in love ?). While I spend most of my days writing Python code, I must admit I am starting to enjoy my occasional dip in R.

Leave A Comment