trxtools package

trxtools.SAMgeneral module

trxtools.SAMgeneral.countDeletion(i=(), expand=0)

Takes tuple (position,CIGARstring) and returns list of mapped deletions. Each deletion is counted once, but expand parameter can merge some longer deletions (common expansion).

Parameters
  • i (tuple) – tuple (first position of the read ,CIGAR string)

  • expand (int, optional) – number of nucleotides to expand for each side, defaults to 0

Returns

list of mapped positions

Return type

np.array

>>> countDeletion((400,"3S15M1D9M2S"))
array([415])
>>> countDeletion((400,"3S15M1D9M2S"),expand=3)
array([412, 413, 414, 415, 416, 417, 418])
trxtools.SAMgeneral.countMiddle(i=(), expand=0)

Takes tuple (position,CIGARstring) and returns list with the middle of mapped read

Parameters
  • i (tuple) – tuple (first position of the read ,CIGAR string)

  • expand (int, optional) – number of nucleotides to expand for each side, defaults to 0

Returns

list of mapped positions

Return type

np.array

>>> countMiddle((400,"3S15M1D9M2S"))
array([412])
>>> countMiddle((400,"3S15M1D9M2S"),expand=3)
array([409, 410, 411, 412, 413, 414])
trxtools.SAMgeneral.countRead(i=())

Takes tuple (position,CIGARstring) and returns list of mapped positions

Parameters

i (tuple) – tuple (first position of the read ,CIGAR string)

Returns

list of mapped positions

Return type

np.array

>>> countRead((400,"3S15M1D9M2S"))
array([400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
   413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424])
trxtools.SAMgeneral.groupCIGAR(cigar_string='')

Split CIGAR string to list of tuples

Parameters

cigar_string (str) –

Returns

list of tuples [( ),()]

Return type

list

>>> groupCIGAR("3S44M1S1H")
[('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]
trxtools.SAMgeneral.noncoded2profile(df_input=Empty DataFrame Columns: [] Index: [], df_details=Empty DataFrame Columns: [] Index: [])

Turns non-coded ends into profile

Parameters
  • df_input (DataFrame) – output of parseNoncoded function

  • df_details (DataFrame) – chromosome lengths

Returns

DataFrame with profiles

Return type

DataFrame

trxtools.SAMgeneral.noncoded2profile1(df=Empty DataFrame Columns: [] Index: [], length=0)

Turns non-coded ends into profile

Parameters
  • df_input (DataFrame) – output of parseNoncodedList function

  • df_details (DataFrame) – chromosome lengths

Returns

Series with profiles

Return type

Series

trxtools.SAMgeneral.parseNoncoded(d={}, minLen=3)

Parse dict with non-coded ends and returns structured DataFrame

Parameters
  • d (dict) – dictionary with list of tuples for each chromosme {"chrI" : [], "chrII" : []}, defaults to dict()

  • minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3

Returns

DataFrame with parsed non-coded ends

Return type

DataFrame

>>> parseNoncoded({"chrI":[(40, 'AAA'), (35, 'AACAA')]})
   index  AAA  AACAA   chr
0     40  1.0    NaN  chrI
1     35  NaN    1.0  chrI
trxtools.SAMgeneral.parseNoncodedList(l=[], minLen=3)

Parse list with non-coded ends and returns structured DataFrame

Parameters
  • l (list) – list of tuples [(int,str)], defaults to lits()

  • minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3

Returns

DataFrame with parsed non-coded ends

Return type

DataFrame

>>> parseNoncodedList([(40, 'AAA'), (35, 'AACAA')])
    AAA     AACAA
35  NaN     1.0
40  1.0     NaN
trxtools.SAMgeneral.saveBigWig(paths={}, suffix='', bw_name='', chroms=[])

Save gzip pickle data to BigWig

Parameters
  • paths (_type_, optional) – _description_, defaults to dict()

  • suffix (_type_, optional) – _description_, defaults to str()

  • bw_name (_type_, optional) – _description_, defaults to str()

  • chroms (_type_, optional) – _description_, defaults to list()

Returns

_description_

Return type

_type_

trxtools.SAMgeneral.selectEnds(df=Empty DataFrame Columns: [] Index: [], ends='polyA')

Wrapper for functions selecting non-coded ends

Parameters
  • df (DataFrame) – output of parseNoncoded or parseNoncodedList

  • ends (str) – type of ends, currently only “polyA” is availible, defaults to “polyA”

Returns

runs selectPolyA

Return type

DataFrame

trxtools.SAMgeneral.selectPolyA(df=Empty DataFrame Columns: [] Index: [])

Select only polyA non-coded ends containinig "AAA" and “A”-content above 75%

Parameters

df (DataFrame) – output of parseNoncoded or parseNoncodedList

Returns

modified DataFrame

Return type

DataFrame

trxtools.SAMgeneral.selectSortPaths(paths={}, chroms=[], suffix='')

pyBigWig requires for input sorted chromsomes

Parameters
  • paths (dict, optional) – _description_, defaults to {}

  • chroms (list, optional) – _description_, defaults to []

  • suffix (str, optional) – _description_, defaults to “”

Returns

_description_

Return type

_type_

trxtools.SAMgeneral.stripCIGAR(match=[], to_strip='H')

Removes H from output of groupCIGAR

Parameters
  • match (list) – output of groupCIGAR, defaults to []

  • to_strip (str, optional) – CIGAR mark to be stripped, defaults to “H”

Returns

modified list of tuples

Return type

list

trxtools.SAMgeneral.stripSubstitutions(match)

Strip substutiotns on both ends

Parameters

match (list) – output of groupCIGAR, defaults to []

Returns

list of tuples

Return type

list

trxtools.SAMgeneral.tostripCIGARfive(match=[])

Calculates length of soft-clipped nucleotides at the 5’ end of the read

Parameters

match (list) – output of groupCIGAR, defaults to []

Returns

number of substituted nucleotides

Return type

int

>>> tostripCIGARfive([('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')])
3
trxtools.SAMgeneral.tostripCIGARthree(match=[])

Nucleotides without alignment at the 3’ end of the read

Parameters

match (list) – output of groupCIGAR, defaults to []

Returns

number of substituted nucleotides

Return type

int

>>> tostripCIGARthree([('3', 'S'), ('44', 'M'), ('1', 'S')])
1

trxtools.SAMgenome module

trxtools.SAMgenome.chromosome2profile3end(l=[], length=0, strand='FWD')

Generates profile for the 3’ ends of reads and saves position of non-coded end

Parameters
  • l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []

  • length (int) – length of chromosome

  • strand (str, optional) – 'FWD' or 'REV', defaults to ‘FWD’

Returns

profile, noncoded

Return type

np.array, list of tuples

>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD")
(0     0.0
1     0.0
...
34    0.0
35    1.0
36    0.0
...
50    0.0
dtype: float64,
[(35, 'CC')])
>>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV")
(0     0.0
1     0.0
...
39    0.0
40    1.0
41    0.0
...
50    0.0
dtype: float64,
[(40, 'AAA')])
trxtools.SAMgenome.parseHeader(filename, name, dirPath)
trxtools.SAMgenome.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='read', logName='')

Function used by sam2genome. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

Returns

output_df_fwd, output_df_rev, log

Return type

DataFrame, DataFrame, list

trxtools.SAMgenome.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='3end', noncoded=True, ends='polyA', logName='', minLen=3)

Function used by sam2genome3end. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

  • noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True

Returns

output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev

Return type

DataFrame, DataFrame, list, DataFrame, DataFrame

trxtools.SAMgenome.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='5end', logName='')

Function used by sam2genome5end. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

Returns

output_df_fwd, output_df_rev, log

Return type

DataFrame, DataFrame, list

trxtools.SAMgenome.sam2genome(filename='', path='', toClear='', chunks=0, use='3end', noncoded=True, ends='polyA')

Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.

Parameters
  • filename (str) –

  • path (str) –

  • toClear (str, optional) – element of filename to be removed, defaults to ‘’

  • chunks (int, optional) – Read SAM file in chunks, defaults to 0

  • noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True

  • noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False

trxtools.SAMgenome_old module

trxtools.SAMgenome_old.chromosome2profile3end(l=[], length=0, strand='FWD')

Generates profile for the 3’ ends of reads and saves position of non-coded end

Parameters
  • l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []

  • length (int) – length of chromosome

  • strand (str, optional) – 'FWD' or 'REV', defaults to ‘FWD’

Returns

profile, noncoded

Return type

np.array, list of tuples

>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD")
(0     0.0
1     0.0
...
34    0.0
35    1.0
36    0.0
...
50    0.0
dtype: float64,
[(35, 'CC')])
>>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV")
(0     0.0
1     0.0
...
39    0.0
40    1.0
41    0.0
...
50    0.0
dtype: float64,
[(40, 'AAA')])
trxtools.SAMgenome_old.parseHeader(filename, name, dirPath)
trxtools.SAMgenome_old.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])

Function used by sam2genome. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

Returns

output_df_fwd, output_df_rev, log

Return type

DataFrame, DataFrame, list

trxtools.SAMgenome_old.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], noncoded=True)

Function used by sam2genome3end. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

  • noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True

Returns

output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev

Return type

DataFrame, DataFrame, list, DataFrame, DataFrame

trxtools.SAMgenome_old.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])

Function used by sam2genome5end. Works for both strands.

Parameters
  • name (str) – name of experiment

  • dirPath (str) –

  • df_details (DataFrame) – lengths of chromosomes

Returns

output_df_fwd, output_df_rev, log

Return type

DataFrame, DataFrame, list

trxtools.SAMgenome_old.sam2genome(filename='', path='', toClear='', pickle=False, chunks=0)

Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.

Parameters
  • filename (str) –

  • path (str) –

  • toClear (str, optional) – element of filename to be removed, defaults to ‘’

  • pickle (bool, optional) – save output in pickle format, defaults to False

  • chunks (int, optional) – Read SAM file in chunks, defaults to 0

trxtools.SAMgenome_old.sam2genome3end(filename='', path='', toClear='', pickle=False, chunks=0, noncoded_pA=True, noncoded_raw=False)

Function handling SAM files and generating profiles for the 3’ end of reads. Executed using wrapping script SAM2profilesGenomic.py.

Parameters
  • filename (str) –

  • path (str) –

  • toClear (str, optional) – element of filename to be removed, defaults to ‘’

  • pickle (bool, optional) – save output in pickle format, defaults to False

  • chunks (int, optional) – Read SAM file in chunks, defaults to 0

  • noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True

  • noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False

trxtools.SAMgenome_old.sam2genome5end(filename='', path='', toClear='', pickle=False, chunks=0)

Function handling SAM files and generating profiles for the 3’ end of reads. Executed using wrapping script SAM2profilesGenomic.py.

Parameters
  • filename (str) –

  • path (str) –

  • toClear (str, optional) – element of filename to be removed, defaults to ‘’

  • pickle (bool, optional) – save output in pickle format, defaults to False

  • chunks (int, optional) – Read SAM file in chunks, defaults to 0

  • noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True

  • noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False

trxtools.SAMtranscripts module

trxtools.SAMtranscripts.reads2profile(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])

Takes list of aligned reads and transform to profile for each transctipt. Tested only for PLUS strand.

trxtools.SAMtranscripts.reads2profileDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], expand=5)

Tested only for PLUS strand.

trxtools.SAMtranscripts.sam2profiles(filename='', path='', geneList=[], toClear='', df_details=Empty DataFrame Columns: [] Index: [], deletions=False, expand=5, pickle=False, chunks=0)

Function handling SAM files and generating profiles. Executed using wrapping script SAM2profiles.py.

Parameters
  • filename (str) –

  • path (str) –

  • geneList (list) – list of transcript to be selected

  • toClear (str, optional) – element of filename to be removed, defaults to ‘’

  • df_details (DataFrame) – Details of transcripts

  • deletions (bool, optional) – Generate profile of deletions, defaults to False

  • expand (int, optional) – Expand deletions, defaults to 5

  • pickle (bool, optional) – save output in pickle format, defaults to False

  • chunks (int, optional) – Read SAM file in chunks, defaults to 0

trxtools.SAMtranscripts.transcript2profile(l=[], length=0)

Takes list of tuples position,CIGARstring) and generates profile. Works with entire reads, for both strands.

trxtools.SAMtranscripts.transcript2profileDeletions(l=[], expand=0, length=0)

Takes list of tuples position,CIGARstring) and generates profile. Not tested for MINUS strand.

trxtools.assays module

trxtools.assays.bulkInputIDT(data=Empty DataFrame Columns: [] Index: [])

Tranforms dataframe with colums “template” and “non-template” and prepares table to be used as bulk input discards oligos longer than 200 nt.

Parameters

data – DataFrame()

Returns

DataFrame with sequences to order

trxtools.assays.extruded(seq='', buried=13)
Parameters
  • seq – str()

  • buried – int() length of sequence buried within RNAP

Returns

Extruded sequence, str

trxtools.assays.findPrimer(seq='', query='')

Find query sequence within given sequence

Parameters
  • seq – str() containing sequence

  • query – str() within searched sequences

Returns

(start,stop)

trxtools.assays.nonTemplateDNA(seq='', primer='')
Parameters
  • seq – str() sequence

  • primer – str() sequence

Returns

DNA sequence of non-template strand, str

trxtools.assays.sequenceConstrain(structure='', stall='', RNAprimer='')

Returns sequence constrained with the 5prime RNA primer

Parameters
  • structure – str()

  • stall – str() single letter

  • RNAprimer – str()

Returns

str() with sequence constrains for folding algorithm

trxtools.assays.stalled(seq='', stall='AAA', primer='')

Finds and returns stalled sequence

Parameters
  • seq – str()

  • stall – str() default “AAA”

  • primer – str()

Returns

stalled sequence, str

trxtools.assays.structureFile(structure='', stall='', RNAprimer='')

Saves structure file in current directory

Parameters
  • structure – str()

  • stall – str()

  • RNAprimer – str()

Returns

True

trxtools.assays.templateDNA(seq='', overhang5end='')
Parameters
  • seq – str() sequence of RNA

  • overhang5end – str() sequence of the 5’end DNA ovethang

Returns

DNA sequence of template strand, str

trxtools.assays.testScaffold(data=Empty DataFrame Columns: [] Index: [], overhang5end='', RNA_primer='')

Prints scaffold to test before ordering

Parameters
  • data – DataFrame()

  • overhang5end – str() overhang to shift sequences

  • RNA_primer – str() sequence

Returns

trxtools.assays.toOrder(data=Empty DataFrame Columns: [] Index: [], buried='GUCUGUUUUGUGG', stallFull='AAA', afterStall='TGATCGGTAC', overhang5end='TGA', RNA_primer='AGGCCGAAA', bulkInput=True, test=True, lengthMax=200)

Transfers folding function to DNA sequences

Parameters
  • data – DataFrame()

  • buried – str() sequence

  • stallFull – str() sequence

  • afterStall – str() sequence

  • overhang5end – str() sequence

  • RNA_primer – str() sequence

  • bulkInput – boolean() default True

  • test – boolean() default True

  • lengthMax – int() default 200

Returns

DataFrame of sequences to order

trxtools.go_enrichment module

trxtools.go_enrichment.get_enrichment(query_genes, organism, use_reference_set=False, ref_genes=None, ref_organism=None, go_dataset='biological_process', test_type='FISHER', correction='FDR')

Run a GO term enrichment test using PANTHER API

Parameters
  • query_genes (list) – List of sequence identifiers of queried genes (e.g. transcript ids, gene ids)

  • organism (str) – Taxid of query species (e.g. “9606” for H. sapiens)

  • use_reference_set (bool) – Use a custom set of rerence (background) genes? Default False. If True, ref_genes and ref_species need to be specifed.

  • ref_genes (list) – optional list of reference genes. Specifying None (default) will use the whole genome of species specified in organism. When passing a list, ref_organism taxid must also be provided.

  • ref_species (str) – Taxid of reference species, required when ref_genes is not None

  • go_dataset (str) – Which annotation dataset to query, “biological_process” or “molecular_function”

  • test_type (str) – Which tatistical test to use. Available: “FISHER” (default), “BINOMIAL”

  • correction (str) – Which multiple testing correction method to use. Available: “FDR” (default), “BONFERRONI”, “NONE”

Returns

Unfiltered DataFrame of results.

Return type

pandas.DataFrame:

trxtools.methods module

trxtools.methods.addCluster(df=Empty DataFrame Columns: [] Index: [], n=10)

Assigns n clusters to the data using KMeans algorithm :param df: DataFrame :param n: no. of clusters, int :return:

trxtools.methods.bashCommand(bashCommand='')

Run command in bash using subprocess.call()

trxtools.methods.calGC(dataset=Empty DataFrame Columns: [] Index: [], calFor=['G', 'C'])

Returns GC content in a given string - uses [‘nucleotide’] column

Parameters

dataset – DataFrame() with “nucleotide” column

Returns

fraction of GC content, float

trxtools.methods.cleanNames(df=Empty DataFrame Columns: [] Index: [], additional_tags=[])

Cleans some problems with names if exist

Parameters
  • df – DataFrame() where names of columns are name of experiments

  • additional_tags – list()

Returns

DataFrame() with new names

trxtools.methods.define_experiments(paths_in, whole_name=False, strip='_hittable_reads.txt')

Parse file names and extract experiment name from them

Parameters
  • paths_in – str()

  • whole_name – boolean() default False. As defaults script takes first ‘a_b_c’

  • strip – str() to strip from filename.

Returns

list() of experiment names, list() of paths.

trxtools.methods.expNameParser(name, additional_tags=[], order='b_d_e_p')

Function handles experiment name; recognizes AB123456 as experiment date; BY4741 or HTP or given string as bait protein

Parameters
  • name

  • additional_tags – list of tags

  • output – default ‘root’ ; print other elements when ‘all’

  • order – defoult ‘b_d_e_p’ b-bait; d-details, e-experiment, p-prefix

Returns

list of reordered name

trxtools.methods.expStats(input_df=Empty DataFrame Columns: [] Index: [], smooth=True, window=10, win_type='blackman')

Returns DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments

Parameters
  • input_df – DataFrame

  • smooth – boolean, if True apply smoothing window, default=True

  • window – int, smoothing window, default 10

  • win_type – str type of smoothing window, default “blackman”

Returns

DataFrame

trxtools.methods.filterExp(datasets, let_in=[''], let_out=['wont_find_this_string'], verbose=False)

Returns object with filtered columns/keys.

Parameters
  • datasets – DataFrame() or dict() with exp name as a key

  • let_in – list() with elements of name to filter in

  • let_out – list() with elements of name to filter out

Returns

DataFrame() or dict()

trxtools.methods.groupCRACsamples(df=<class 'pandas.core.frame.DataFrame'>, use='protein', toDrop=[])

Parse CRAC names and annotates them using on of following features [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’,’sampleRep’]

Parameters
  • df – DataFrame

  • use – str, choose from [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’,’sampleRep’], default = ‘protein’

  • toDrop – list of word in CRAC name that will qualify the sample to rejection, default = []

Returns

DataFrame with added column [‘group’]

trxtools.methods.indexOrder(df=Empty DataFrame Columns: [] Index: [], additional_tags=[], output='root', order='b_d_e_p')

Apply expNameParser to whole DataFrame

Parameters
  • df – DataFrame() where names of columns are name of experiments

  • additional_tags – list()

  • output

  • order – str() default ‘b_d_e_p’ b-bait; d-details, e-experiment, p-prefix

Returns

DataFrame() with new names

trxtools.methods.letterContent(s='', letter='A')
trxtools.methods.list_paths_in_current_dir(suffix='', stdin=False)
Parameters
  • suffix – str() lists paths in current directory ending with an indicated suffix only

  • stdin – boolean() if True read from standard input instead current directory

Returns

list() of paths

trxtools.methods.normalize(df=<class 'pandas.core.frame.DataFrame'>, log2=False, pseudocounts=0.1)
Parameters
  • df – DataFrame

  • log2 – boolean, default=False

  • pseudocounts – float, default=0.1

Returns

trxtools.methods.parseCRACname(s1=<class 'pandas.core.series.Series'>)
Parse CRAC name into [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’] using this order.

“_” is used to split the name

Parameters

s1 – Series,

Returns

DataFrame

trxtools.methods.quantileCategory(s1=Series([], dtype: float64), q=4)

Quantile-based discretization function based on pandas.qcut function.

Parameters
  • s1 – Series()

  • q – int() number of quantiles: 10 for deciles, 5 for quantiles, 4 for quartiles, etc., default q=4

Returns

Series

trxtools.methods.randomDNAall(length=0, letters='CGTA')

Generates all possible random sequences of a given length

Parameters
  • length – int()

  • letters – str() with letters that will be used

Returns

list() of str()

trxtools.methods.randomDNAsingle(length=0, letters='CGTA')

Random generator of nucleotide sequence

Parameters
  • length – int()

  • letters – str() with letters that will be used

Returns

str()

trxtools.methods.readSalmon(nameElem='', path='', toLoad='', toClear=[], toAdd='', column='NumReads', df=None, overwrite=False)
Parameters
  • nameElem – str, elem to load

  • path – str

  • toLoad – str, additional param for filtering, by default equal to nameElem

  • toClear – str

  • toAdd – str

  • df – pd.DataFrame

  • overwrite – boolean, default=False

Returns

trxtools.methods.read_HTSeq_output(path='', toLoad='classes', toClear=[], toAdd='', df=None, overwrite=False)

Reads multiple HTSeq tab files to one DataFrame

Parameters
  • path – str, path to directory with files

  • toClear – str, will be removed from file name

  • toAdd – str, to be added to file name

  • df – DataFrame, to be appended; default=None

  • overwrite – boolean, allows for overwriting during appending, default = False

Returns

DataFrame

trxtools.methods.read_STARstats(path='', toClear=[], toAdd='', df=None, overwrite=False)

Reads multiple HTSeq tab files to one DataFrame

Parameters
  • path – str, path to directory with files

  • toClear – str, will be removed from file name

  • toAdd – str, to be added to file name

  • df – DataFrame, to be appended; default=None

  • overwrite – boolean, allows for overwriting during appending, default = False

Returns

DataFrame

trxtools.methods.read_featureCount(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)

Read tab files with common first column :param nameElem: str, present in all files :param path: str, path to directory with files :param toLoad: str, to be present in file name (optional) :param toClear: str, will be removed from file name :param toAdd: str, to be added to file name :param df: DataFrame, to be appended; default=None :param overwrite: boolean, allows for overwriting during appending, default = False :return: DataFrame

trxtools.methods.read_list(filepath='')

Read list from file. Each row becomes item in the list.

Parameters

filepath – str

Returns

list

trxtools.methods.read_tabFile(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)

Read tab files with common first column :param nameElem: str, present in all files :param path: str, path to directory with files :param toLoad: str, to be present in file name :param toClear: str, will be removed from file name :param toAdd: str, to be added to file name :param df: DataFrame, to be appended; default=None :param overwrite: boolean, allows for overwriting during appending, default = False :return: DataFrame

trxtools.methods.reverse_complement(seq)

Reverse complement

Parameters

seq – str

Returns

str

trxtools.methods.reverse_complement_DNA(seq)

Reverse complement

Parameters

seq – str

Returns

str

trxtools.methods.reverse_complement_RNA(seq)

Reverse complement

Parameters

seq – str

Returns

str

trxtools.methods.rollingGC(s=<class 'pandas.core.series.Series'>, window=10)

Calculates GC from sequence, uses ‘boxcar’ window

Parameters
  • s – Series containing sequence

  • window – window size for GC calculation

Returns

Series with GC calculated, center=False

trxtools.methods.runPCA(data=Empty DataFrame Columns: [] Index: [], n_components=2)

Run PCA analysis and re-assigns column names and index names

Parameters
  • data – DataFrame

  • n_components – int, default 2

Returns

tuple consisting of DataFrame with PCA results and a list of PC values

Return type

tuple

trxtools.methods.timestamp()
Returns

timestamp as a str()

trxtools.methods.timestampRandomInt()
Returns

timestamp and random number as a str()

trxtools.nascent module

class trxtools.nascent.Fold(tempDir=None)

Bases: object

RNAfold(data, saveData=False, temp=None)

Calculates dG using RNAfold (ViennaRNA)

Parameters
  • data – input data {list, Series, DataFrame with “name” column}

  • saveData – boolean, default False

  • temp – int, default None

Returns

DataFrame

RNAinvert(structure='', saveData=False, temp=None, n=5, RNAprimer='', stall='', quick=False)

Returns n sequences with with given structure

Parameters
  • structure – str with secondary RNA structure

  • saveData – boolean, default False

  • temp – int, default None

  • n – int number of output sequences, default 5

  • RNAprimer – str sequence

  • stall – str nucleotide

  • quick – boolean if False uses -Fmp -f 0.01 params, default False

Returns

DataFrame

RNAinvertStall(structure='', RNAprimer='', stall='A', n=200)

Returns sequences without nt that is present in stall

Parameters
  • structure – str with secondary RNA structure

  • RNAprimer – str sequence

  • stall – str sequence, default “A”

  • n – int, default 200

Returns

DataFrame

UNAfold(data, saveData=False, temp=None)

Calculates dG using UNAfold

Parameters
  • data – input data {list, Series, DataFrame with “name” column}

  • saveData – boolean, default False

  • temp – int, default None

Returns

DataFrame

bashFolding(method='RNA')

Runs RNA folding using bash

Parameters

method – “RNA” for ViennaRNA or “UNA” for UNAfold”

Returns

class trxtools.nascent.Hybrid(tempDir=None)

Bases: object

RNAhybrid(data, saveData=False, temp=None)

Calculates dG using hybrid-min

Parameters
  • data – input data {list, Series, DataFrame with “name” column}

  • saveData – boolean, default False

  • temp – int, default None

Returns

DataFrame

bashHybrid()

Runs hybris-min using bash

Returns

trxtools.nascent.analyseViennamarkGC(vienna='', sequence='')

Leaves only C and G in stem structures

Parameters
  • vienna – str with vienna format

  • sequence – str sequence

Returns

str

trxtools.nascent.extendingWindow(sequence='', name='name', strand='plus', temp=30, m=7)

Returns DataFrame of sequences of all possible lengths between minimum (m) and length of input sequence -1

Parameters
  • sequence – str

  • name – str, default “name”

  • strand – str {“plus”,”minus”,”both”}, default “plus” (not tested for others)

  • temp – int, default 30

  • m – int, default 7 - RNAfold does not return any values for length shorter than 7 even at 4 deg C

Returns

DataFrame of sequences with names according to nascent.slidingWindow convention

trxtools.nascent.foldNascentElem(data=Empty DataFrame Columns: [] Index: [])

Fold the very 3’ of nascent elements.

Parameters

data – DataFrame

Returns

DataFrame

trxtools.nascent.handleInput(data, keepNames=True)

Input data with columns: “seq” or “sequence” and “name” (optional)

Parameters

data – {str, list, Series, DataFrame}

:param keepNames if True use given names, default True :return: DataFrame where index become name of sequence to fold

trxtools.nascent.join2d(df=Empty DataFrame Columns: [] Index: [], use='format')
Parameters
  • df – DataFrame

  • use – str, default “format”

Returns

DataFrame

trxtools.nascent.markVienna(df=Empty DataFrame Columns: [] Index: [])

Apply analyseViennamarkGC

Parameters

df – DataFrame

Returns

DataFrame

trxtools.nascent.merge2d(df=<class 'pandas.core.frame.DataFrame'>)
Parameters

df – DataFrame

Returns

DataFrame

trxtools.nascent.name2index(s1=Series([], dtype: object))

Extracts position from sequence name

Parameters

s1 – Series with names from prepareNascent function

Returns

Series with positions

trxtools.nascent.nascentElems(vienna='', sequence='')

Describe elements of secondary structure: stems, multistems.

Parameters
  • vienna – str

  • sequence – str

Returns

str

trxtools.nascent.nascentElemsDataFrame(data=Empty DataFrame Columns: [] Index: [])

Apply nascentElem to DataFrame of folded sequences

Parameters

data – DataFrame containing df[‘vienna’] and df[‘sequence’]

Returns

DataFrame

trxtools.nascent.nascentFolding(sequence='', temp=30, window=100)

Combines folding function: fold RNA, locate last nascent element and calculate dG of it.

Parameters
  • sequence – str

  • temp – int, default 30

  • window – int, default 100

Returns

DataFrame

trxtools.nascent.parseFoldingName(df=Empty DataFrame Columns: [] Index: [])
trxtools.nascent.prepareNascent(sequence='', name='name', strand='plus', temp=30, window=100)

Divide long transcript into short sequences. Combines output of extendingWindow and slidingWindow.

Parameters
  • sequence – str

  • name – str, default “name”

  • strand – str {plus,minus,both}, default “plus” (not tested for others)

  • temp – int, default 30

  • window – int, default 100

Returns

Dataframe with sequences

trxtools.nascent.selectFoldedN(data=Empty DataFrame Columns: [] Index: [], n=5, pattern='(((((....)))))')

Takes Fold().RNAFold() df as an input. Selects n rows with a given pattern on the 5end and most different folding energy

Parameters
  • data – DataFrame

  • n – int samples, default 5

  • pattern – str vienna format, default “(((((….)))))”

Returns

DataFrame

trxtools.nascent.slidingWindow(sequence='', name='name', strand='plus', temp=30, window=100)

Slices sequence using sliding window

Parameters
  • sequence – str

  • name – str default “name”

  • strand – str {“plus”,”both”,”minus”} default “plus”

  • temp – int default 30

  • window – int default 80

Returns

DataFrame with sliding windows

trxtools.plotting module

trxtools.plotting.clusterClusterMap(df)

Clustermap for clusters :param df: DataFrame :return:

trxtools.plotting.hplotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.hplotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.hplotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.hplotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.hplotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.hplotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotCumulativePeaks(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, title='', start=None, stop=None, window=50, figsize=(4, 3), color1='green', color2='magenta', lc='red')

Plot single gene peaks metaplot.

Parameters
  • ref – str with path to csv file or DataFrame

  • df2 – DataFrame

  • local_pos – list of features (peaks/troughs)

  • dpi – int, default 150

  • title – str

  • start – int

  • stop – int

  • window – int, default 50

  • figsize – tuple, default (4,3)

  • color1 – str, default “green”

  • color2 – str, default “magenta”

  • lc – str, default “red”

Returns

trxtools.plotting.plotPCA(data=Empty DataFrame Columns: [] Index: [], names=[], title='', PClimit=1, figsize=(7, 7), PCval=[])

Plot PCA plot

Parameters
  • data – DataFrame

  • names – list of names to annotate

  • title – str

  • PClimit – int number of PC to plot, default 1

  • figsize – tuple, default (7,7)

Returns

>>> plotPCA(methods.runPCA(example_df)[0])
trxtools.plotting.plotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Parameters
  • df – DataFrame

  • dpi – int, default=150

Returns

trxtools.plotting.plot_as_box_plot(df=Empty DataFrame Columns: [] Index: [], title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', h_lines=[], lc='red', offset=0)

Plots figure similar to box plot: median, 2 and 3 quartiles and min-max range

Parameters
  • df – Dataframe() containing following columns:`['position'] ['mean'] ['median'] ['std']` optionally `['nucleotide'] ['q1'] ['q3'] ['max'] ['min']`

  • title – str

  • start – int

  • stop – int

  • figsize – tuple, default (7,4)

  • ylim – tuple OY axes lim. Default (None,0.01)

  • dpi – int, default 150

  • color – str, default “green”

  • h_lines – list of horizontal lines

  • lc – str color of horizontal lines, default “red”

  • offset – int number to offset position if 5’ flank was used, default 0

Returns

trxtools.plotting.plot_diff(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm', label='', start=None, stop=None, plot_medians=True, plot_ranges=True, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], offset=0)

Plot given dataset and reference, differences are marked

Parameters
  • ref – str with path to csv file or DataFrame

  • dataset – DataFrame containing following columns:`['position'] ['mean'] ['median'] ['std']` optionally `['nucleotide'] ['q1'] ['q3'] ['max'] ['min']`

  • ranges – str “mm” : min-max or “qq” : q1-q3

  • label – str

  • start – int

  • stop – int

  • plot_medians – boolean if True plot medians, default True

  • plot_ranges – boolean if True plot ranges, default True

  • figsize – tuple, default (7,3)

  • ylim – tuple OY axes lim, default (None,0.01)

  • h_lines – list of horizontal lines

Returns

trxtools.plotting.plot_heatmap(df=Empty DataFrame Columns: [] Index: [], title='Heatmap of differences between dataset and reference plot for RDN37-1', vmin=None, vmax=None, figsize=(20, 10))

Plot heat map of differences, from dataframe generated by compare1toRef(dataset, heatmap=True) function

Parameters
  • df – DataFrame

  • title – str

  • vmin

  • vmax

  • figsize – tuple, default (20,10)

Returns

trxtools.plotting.plot_to_compare(ref, df=Empty DataFrame Columns: [] Index: [], color1='green', color2='black', ref_label='', label='', title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', dpi=150, offset=300)

Figure to compare to plots similar to box plot: median, 2 and 3 quartiles and min-max range

Parameters
  • ref – str with path to csv file or DataFrame

  • df – DataFrame

  • color1 – str, default “green”

  • color2 – str, default “black”

  • ref_label – str

  • label – str

  • title – str

  • start – int

  • stop – int

  • figsize – tuple, default (7,4)

  • ylim – tuple OY axes lim. Default (None,0.01)

  • h_lines – list of horizontal lines

  • lc – str color of horizontal lines, default “red”

  • dpi – int, default 150

  • offset – int number to offset position if 5’ flank was used, default 0

Returns

trxtools.profiles module

trxtools.profiles.FoldingFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, ranges=0, offset=15, fold='dG65nt@30C')

Pulls folding information from BigWig folding data for a given gene.

Parameters
  • gene_name – str

  • gtf – pyCRAC.GTF2 object with GTF and TAB files loaded

  • bwFWD – dict of pyBigWig objects

  • bwREV – dict of pyBigWig objects

  • ranges – int flanks to be added for the gene, default 0

  • offset – int to offset folding data, default 15

  • fold – name of output column, default=”dG65nt@30C

Returns

DataFrame

trxtools.profiles.calculateFDR(data=Series([], dtype: float64), iterations=100, target_FDR=0.05)

Calculates False Discovery Rate (FDR) for a given dataset.

Parameters
  • data – Series

  • iterations – int, default 100

  • target_FDR – float, detault 0.05

Returns

Series

trxtools.profiles.compare1toRef(ref, dataset=Series([], dtype: float64), ranges='mm', heatmap=False, relative=False)

Takes Series and compare this with reference DataFrame()

Parameters
  • ref – str with path to csv file or DataFrame

  • dataset – Series

  • ranges – mm : min-max or qq : q1-q3

  • heatmap – boolean, heatmap=False: Dataframe with(reference_above_experiment minimum etc.): rae_min, rae_max, ear_min, ear_max; heatmap=True: Series of differences to plot heatmap

  • relative – boolean, only for heatmap, recalculates differences according to the peak size. Warning: negative values are in range -1 to 0 but positive are from 0 to values higher than 1

Returns

Dataframe (heatmap=False) or Series (heatmap=True)

trxtools.profiles.compareMoretoRef(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm')

Takes Dataframe created by filter_df and compare this with reference DataFrame

Parameters
  • ref – str with path to csv file or DataFrame

  • dataset – Series

  • ranges – mm : min-max or qq : q1-q3

Returns

Dataframe

trxtools.profiles.dictBigWig(files=[], path='', strands=True)

Preloads BigWig files to memory using pyBigWig tools

Parameters
  • files – list of files

  • path – str

  • strands – boolean, default True

Returns

dict or dict, dict

trxtools.profiles.findPeaks(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)

Find local extrema using SciPy argrelextrema function

Parameters
  • s1 – Series data to localize peaks

  • window – int, To smooth data before peak-calling. default 1 (no smoothing)

  • win_type – str type of smoothing window, default “blackman”

  • order – int minimal spacing between peaks, argrelextrema order parameter, default 20

Returns

list of peaks

trxtools.profiles.findTroughs(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)

Find local minima using SciPy argrelextrema function

Parameters
  • s1 – Series data to localize peaks

  • window – int, To smooth data before trough-calling. default 1 (no smoothing)

  • win_type – str type of smoothing window, default “blackman”

  • order – int minimal spacing between min, argrelextrema order parameter, default 20

Returns

list of troughs

trxtools.profiles.geneFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, toStrip='', ranges=0)

Pulls genome coverage from BigWig data for a given gene. One BigWig file -> one column.

Parameters
  • gene_name – str

  • gtf – pyCRAC.GTF2 object with GTF and TAB files loaded

  • bwFWD – dict of pyBigWig objects

  • bwREV – dict of pyBigWig objects

  • toStrip – str of name to be stripped

  • ranges – int flanks to be added for the gene, default 0

Returns

DataFrame

trxtools.profiles.ntotal(df=<class 'pandas.core.frame.DataFrame'>, drop=True)

Normalize data in DataFrame to fraction of total column

Parameters
  • df – DataFrame

  • drop – boolean, if True drop ‘position’ and ‘nucleotide’ columns, default True

Returns

DataFrame

trxtools.profiles.parseConcatFile(path, gtf, use='reads', RPM=False, ranges=1000)

Parse concat file

Parameters
  • path – str with path of the concat file

  • gtf – pyCRAC.GTF2 object with GTF and TAB files loaded

  • use – str with name of column tu use [‘reads’, ‘substitutions’, ‘deletions’], default “reads”

  • RPM – boolean, default False

  • ranges – int flanks to be added for the gene, default 0

Returns

dict of DataFrames; using gene name as a key

trxtools.profiles.preprocess(input_df=Empty DataFrame Columns: [] Index: [], let_in=[''], let_out=['wont_find_this_string'], stats=False, smooth=True, window=10, win_type='blackman')

Combines methods.filterExp and expStats. Returns DataFrame with choosen experiments, optionally apply smoothing and stats

Parameters
  • input_df – DataFrame

  • let_in – list of words that characterize experiment, default [‘’]

  • let_out – list of words that disqualify experiments, default [‘wont_find_this_string’]

  • stats – boolean, if True return stats for all experiments, default False

  • smooth – boolean, if True apply smoothing window, default True

  • window – int smoothing window, default 10

  • win_type – str type of smoothing window, default “blackman”

Returns

DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments

trxtools.profiles.pseudocounts(df=<class 'pandas.core.frame.DataFrame'>, value=0.01, drop=True)

Add pseudocounts to data

Parameters
  • df – DataFrame

  • value – float, default 0.01

  • drop – boolean, if True drop ‘position’ and ‘nucleotide’ columns, default True

Returns

DataFrame

trxtools.profiles.save_csv(data_ref=Empty DataFrame Columns: [] Index: [], datasets=Empty DataFrame Columns: [] Index: [], path=None)

Saves Dataframe to csv

Parameters
  • data_ref – DataFrame with ['position'] and ['nucleotide'] columns

  • datasets – DataFrame containinig experimental data only

  • path – str, Optional: path to save csv. Default None

Returns

DataFrame

trxtools.profiles.stripBigWignames(files=[])

Strip “_rev.bw” and “_fwd.bw” form file names

Parameters

files – list of filenames

Returns

list of unique names

trxtools.secondary module

trxtools.secondary.Lstem(vienna='')

Returns list of positions where “(” is found using coordinates {1,inf}

Parameters

vienna – str

Returns

list

trxtools.secondary.Rstem(vienna='')

Returns list of positions where “)” is found using coordinates {1,inf}

Parameters

vienna – str

Returns

list

trxtools.secondary.checkVienna(sequence='', vienna='')

Validates integrity of vienna file

Parameters
  • sequence – str

  • vienna – str

Returns

True if pass

trxtools.secondary.loopStems(vienna='', sequence='', loopsList=None, testPrint=False)

Returns postions of stem of single hairpins and multiloop stems. Use coordinates {1:inf}. Warninig: tested with single multiloop stems only

Parameters
  • vienna – str

  • loopsList – list (option)

  • testPrint – boolean to default=False

Returns

list, list (stems: list of tuples; multistems: list)

trxtools.secondary.loops(vienna='')

Returns first positions outside the loop i.e. “.((….)).” returns [(3,8)]

Parameters

vienna – vienna

Returns

list of tuples

trxtools.secondary.substructures(vienna='', sequence='')

list sub-structures of the given structure

Parameters
  • vienna – str

  • sequence – str

Returns

Series

trxtools.secondary.test(vienna='', sequence='', loops=None, stems=None, multistems=None, linkers=None)

Prints vienna with given features

Parameters
  • vienna – str

  • loops – list of tuples (option)

  • stems – list of tuples (option)

  • multistems – list (option)

  • linkers – list (option)

Returns

None

trxtools.secondary.vienna2format(vienna='', sequence='', loopsList=None, stemsList=None, multistemsList=None, testPrint=False)

Converts vienna format to letters: O - loop, S - stem, M - multiloop stem and L - linker

Parameters
  • vienna – str

  • loopsList – list (optional)

  • stemsList – list (optional)

  • multistemsList – list (optional)

  • testPrint – defauls=False

Returns

str in “format”