Skip to content
/ km Public

km : a software for RNA-seq investigation using k-mer decomposition

License

Notifications You must be signed in to change notification settings

iric-soft/km

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3ae2273 · Jun 3, 2024
Nov 1, 2018
Nov 19, 2021
May 25, 2020
Jun 3, 2024
Oct 7, 2020
Nov 19, 2021
Oct 5, 2020
Oct 24, 2017
Jun 3, 2024
Jun 3, 2024
Jun 3, 2024

Repository files navigation

km : a software for RNA-seq investigation using k-mer decomposition

pyversion pypi codecov

Introduction:

This tool was developed to identify and quantify the occurence of single nucleotide variants, insertions, deletions and duplications in RNA-seq data. Contrary to most tools that try to report all variants in a complete genome, here we instead propose to focus the analysis on small regions of interest.

Given a reference sequence (typically a few hundred base pairs) around a known or suspected mutation in a gene of interest, all possible sequences that can be be created between the two end k-mers according to the sequenced reads will be reported. A ratio of variant allele vs WT will be computed for each possible sequence constructed.

Citing:

Install:

Using pip (recommended)

python3 -m venv $HOME/.virtualenvs/km
source $HOME/.virtualenvs/km/bin/activate
pip install --upgrade pip setuptools wheel
pip install km-walk

Requirements:

  • Python 3.8.0 or later with pip installed.

Usage:

  • The virtual environment needs to be loaded each time you open a new terminal, with this command:
$ source $HOME/.virtualenvs/km/bin/activate

Test:

  • 4bp insertion in NPM1
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa
Sample        Region  Location        Type    Removed Added   Abnormal        Normal  Ratio   Min_coverage    Exclu_min_cov   Variant Target  InfoVariant_sequence    Reference_sequence
./data/jf/02H025_NPM1.jf      chr5:171410540-171410543        chr5:171410544  ITD     0       4 | 4   2870.6  3055.2  0.484   2428            /TCTG   NPM1_4ins_exons_10-11utr        vs_ref  AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA    AATTGCTTCCGGATGACTGACCAAGAGGCTATTCAAGATCTCTGGCAGTGGAGGAAGTCTCTTTAAGAAAATAGTTTAAA
./data/jf/02H025_NPM1.jf              -       Reference       0       0       0.0     2379.0  1.000   2379            -       NPM1_4ins_exons_10-11utr        vs_ref
# To display kmer coverage
$ km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf -g
  • ITD of 75 bp
$ cd [your_km_folder]
$ km find_mutation ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa ./data/jf/03H116_ITD.jf | km find_report -t ./data/catalog/GRCh38/FLT3-ITD_exons_13-15.fa
Sample        Region  Location        Type    Removed Added   Abnormal        Normal  Ratio   Min_coverage    Exclu_min_cov   Variant Target  Info    Variant_sequence        Reference_sequence
./data/jf/03H116_ITD.jf               -       Reference       0       0       0.0     443.0   1.000   912             -       FLT3-ITD_exons_13-15    vs_ref
./data/jf/03H116_ITD.jf       chr13:28034105-28034179 chr13:28034180  ITD     0       75 | 75 417.6   1096.7  0.276   443             /AACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACC    FLT3-ITD_exons_13-15    vs_ref  CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC    CTTTCAGCATTTTGACGGCAACCTGGATTGAGACTCCTGTTTTGCTAATTCCATAAGCTGTTGCGTTCATCACTTTTCCAAAAGCACCTGATCCTAGTACCTTCCCAAACTCTAAATTTTCTCTTGGAAACTCCCATTTGAGATCATATTCATATTCTCTGAAATCAACGTAGAAGTACTCATTATCTGAGGAGCCGGTCACCTGTACCATCTGTAGCTGGCTTTCATACCTAAATTGCTTTTTGTACTTGTGACAAATTAGCAGGGTTAAAACGACAATGAAGAGGAGACAAACACCAATTGTTGCATAGAATGAGATGTTGTCTTGGATGAAAGGGAAGGGGC

Bootstrap:

Or you can run easy_install.sh which installs km in a virtual environement and test it as shown above. Running the script as is will install km in a virtual environment in: $HOME/.virtualenvs/km.

./easy_install.sh

From source code

km can be executed directly from source code.

Requirements:

  • Python 3.6.0 or later
  • Jellyfish 2.2 or later with Python bindings (or pyJellyfish module).

Usage:

$ cd [your_km_folder]
$ python -m km find_mutation ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa ./data/jf/02H025_NPM1.jf | km find_report -t ./data/catalog/GRCh38/NPM1_4ins_exons_10-11utr.fa

Design your target sequence:

  • km is designed to make targeted analysis based on target sequences. These target sequences need to be designed and given to km as input.
  • A target sequence is a nucleotide sequence saved in a fasta file. Some target sequences are provided in catalog.
  • To fit your specific needs, you will have to create your own target sequences.
  • On generic cases, you can follow some good practices described below:

https://raw.githubusercontent.com/iric-soft/km/master/data/figure/doc_target_sequence.png

  • A web portal is available to assist you in the creation of your target sequences (for cases 1 and 2).
  • You could also extract nucleotide sequences from genome using severals methods, two of them are discribe below:
    • Using samtools: samtools faidx chr2:25234341-25234405 GRCh38/genome.fa
    • Using get DNA from ucsc.

Display help:

$ km -h
  usage: PROG [-h] {find_mutation,find_report,linear_kmin,min_cov} ...

  positional arguments:
    {find_mutation,find_report,linear_kmin,min_cov}
                          sub-command help
      find_mutation       Identify and quantify mutations from a target sequence
                          and a k-mer database.
      find_report         Parse find_mutation output to reformat it in tabulated
                          file more user friendly.
      linear_kmin         Find min k length to decompose a target sequence in a
                          linear graph.
      min_cov             Compute coverage of target sequences.

  optional arguments:
    -h, --help            show this help message and exit

km's tools overview:

For more detailed documentation click here.

find_mutation:

This is the main tool of km, to identify and quantify mutations from a target sequence and a k-mer jellyfish database.

$ km find_mutation -h
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table]
$ km find_mutation [your_catalog_directory] [your_jellyfish_count_table]

find_report:

This tool parse find_mutation output to reformat it in more user friendly tabulated file.

$ km find_report -h
$ km find_report -t [your_fasta_targetSeq] [find_mutation_output]
$ km find_mutation [your_fasta_targetSeq] [your_jellyfish_count_table] | km find_report -t [your_fasta_targetSeq]

min_cov:

This tools display some k-mer's coverage stats of a target sequence and a list of jellyfish database.

$ km min_cov -h
$ km min_cov [your_fasta_targetSeq] [[your_jellyfish_count_table]...]

linear_kmin:

Length of k-mers is a central parameter:

  • To produce a linear directed graph from the target sequence.
  • To avoid false-positive. find_mutation shouldn't be use on jellyfish count table build with k<21 bp (we recommand k=31 bp, by default)

linear_kmin tool is design to give you the minimun k length to allow a decomposition of a target sequence in a linear graph.

$ km linear_kmin -h
$ km linear_kmin [your_catalog_directory]

Runing km on a real sample from downloaded fastq:

In the example folder you can find a script to help you to run a km analysis on one Leucegene sample.