Looking over the contents of our growing blog (good job guys !), it occured to me that we had not yet posted an article pertaining to the fantastic (and homegrown !) bioinformatics resource that is pyGeno.
It turns out I need to use pyGeno to generate data and it’s also my turn to write a blog post, how convenient !

I’ll focus the article on writing a SNP filter, which can be a bit surprising the first time you try it.

First off, if you haven’t done so already, please review the installation and importing procedures because those steps will not be covered here. Let’s just say that I’m working on Human Genome GRCh38.83 along with dbSNP149 and a set of sample specific SNPs called using freebayes which I’ll refer to as sampleSNPs.

Now for a reason I won’t go into, I need to filter out SNPs present in the common subset of dbSNP from sampleSNPs and use those remaining to generate the personalized proteome our proteomics friends need to run the peptide identification steps.
I could use tools like the excellent SnpSift/SnpEff to pre-process my SNP sets, but where’s the fun in that ? 😉
And besides, PyGeno’s SNP Filters are perfect for this task !

Here’s the example that the pyGeno documentation leaves you with:


class MyFilter(SNPFilter) :
    def __init__(self, thres) :
        self.thres = thres
        
    def filter(chromosome, SNP_Set1 = None, SNP_Set2 = None ) :
        if SNP_Set1.alt is not None \
                and (SNP_Set1.alt == SNP_Set2.alt) \
                and SNP_Set1.Qmax_gt > self.thres:
            return SequenceSNP(SNP_Set1.alt)
        
        return None

Pretty generic.. But a good starting place nonetheless.
So let’s tweak it a bit to filter out dbSNP149 entries and only keep sampleSNPs entries with a quality threshold above a defined value.


class PseudoSomaticFilter(SNPFilter) :
    def __init__(self, thres) :
        self.thres = thres
        
    def filter(chromosome, sampleSNPs = None, dbSNP149 = None ) :
        if dbSNP149 is None and sampleSNPs.quality > self.thres:
            return SequenceSNP(sampleSNPs.alt)
        
        return None

# init Genome
genome =  Genome(species='human', name='GRCh38.83', SNPs=['sampleSNPs', 'dbSNP149'],     
                 SNPFilter=PseudoSomaticFilter(20))
# and then do some work...

Like I said, a few things might seem a bit off putting here, let’s run through them:

First of all, it is important to understand when the filter method gets called.
Here’s a breakdown of what happens:

  • For every chromosomic position covered by the currently loaded object (protein, transcript, etc..)
  • pyGeno looks for known SNPs at that position in any of the loaded SNP sets
  • If a SNP is found in either SNP set: submit a value for every loaded SNP set to filter

Secondly, it seems a bit strange to use the names of our SNP sets as the parameters of the filter function, doesn’t it ?.
This is due to the fact that, internally, pyGeno makes calls to filter by passing it a dictionary with the SNP set name as key to the SNP data itself (ie: {'sampleSNPs': pyGeno_SNP_entry, 'dbSNP149': pyGeno_SNP_entry}), expanding said dictionnary automatically using python’s ** operator.
This is the aspect that gave me the most trouble at first and ultimately entails that filters are not entirely reusable components. But now that you are aware of this, it shouldn’t throw you off anymore !

Thirdly, as you can see I’m accessing the quality attribute of sampleSNPs to compare against the threshold and there’s nothing bizarre in that but one must be aware of the fact that not all SNP types present the same attributes. In this case, dbSNP149 is a dbSNPSNP and sampleSNPs is an AgnosticSNP. PyGeno supports various SNP types for import and it’s important to know which type you are dealing with when writing your SNPFilter to ensure you have access to the attributes you are expecting.

And finally, to respect the function signature, it is important for our SNPFilter to return one of the following entities:

  • SequenceSNP
  • SequenceInsert
  • SequenceDel
  • None

So there you go, a brand new filter to use on my data for me and debunked beginner issues for you 🙂
Enjoy pyGeno !