Your shell, along with the myriad command line programs it exposes is clearly a great friend when it comes to file manipulation. And let’s face it, file manipulation is a big part of a bioinformatician’s daily workload. Now, since we rarely have the time to review all the options offered by the different programs I thought I’d list some really useful ones from grep.
I expect everyone to know what grep is and what it does so let’s just get to the params 🙂
-i, –ignore-case
Well this is an easy one and it’s useful when you don’t feel compelled to type your search pattern in the exact case structure which is used inside the file (or standard input) of your choosing.
Use case:
# looking for gene entries in both Human and Mouse annotation files
# using Gene Symbols (ie: HOXB4 vs Hoxb4).
> grep -i 'hoxb4' human_or_mouse_annotation_file.txt
-w, –word-regexp
Select only those lines containing matches that form whole words.
This one can be a godsend sometimes. For example: using grep to search for ‘chr1’ and failing to specify -w will leave you feeling sad as you realize you also extracted lines for chr10, 11, 12, 13, etc.. But now you know better !
Use case:
# extracting Chromosome 1 features from a gtf.
> grep -w 'chr1' genes.gtf
-f FILE, –file=FILE
Want to look for a various patterns in one go ?
Dump them in a file of their own, 1 pattern per line, and extract the whole shebang using a single call !
Use case:
# Fetch the MAFs for a set of hundreds of SNPs for which you only have the rsIDs.
> grep -w -F file_of_patterns.txt ESP.vcf
# (Just pipe the output into a small awk script, and bam ! Mission success.)
-v, –invert-match
Pretty much self explanatory: it returns lines that do NOT contain your pattern.
Use case:
# generate a "female" version of a gtf. (because I can..)
> grep -v 'chrY' unisex.gtf
# and you're done !
And last but not least:
-B NUM, –before-context NUM
-A NUM, –after-context NUM
These basically ask grep to print the N (NUM) lines Before of After the pattern hit.
Use case:
# Extract the reads from a fastq which contain a specific nucleotide sequence
> grep -B 1 -A 2 -i 'ccggaacgcagcgaagtggaacgcgcgaactgcgaa' reads.fastq
# (Granted, the sequence's reverse complement is not accounted for...)
Of course, feel free to mix and match to get the desired effect…
And don’t forget that you can pipe a grep outputs… Into grep !
So there you have it.. My favorite grep params.
If I omitted your favorite grep param, please leave a comment !
Leave A Comment