trxtools package
trxtools.SAMgeneral module
- trxtools.SAMgeneral.countDeletion(i=(), expand=0)
Takes tuple (position,CIGARstring) and returns list of mapped deletions. Each deletion is counted once, but expand parameter can merge some longer deletions (common expansion).
- Parameters
i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0
- Returns
list of mapped positions
- Return type
np.array
>>> countDeletion((400,"3S15M1D9M2S")) array([415]) >>> countDeletion((400,"3S15M1D9M2S"),expand=3) array([412, 413, 414, 415, 416, 417, 418])
- trxtools.SAMgeneral.countMiddle(i=(), expand=0)
Takes tuple (position,CIGARstring) and returns list with the middle of mapped read
- Parameters
i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0
- Returns
list of mapped positions
- Return type
np.array
>>> countMiddle((400,"3S15M1D9M2S")) array([412]) >>> countMiddle((400,"3S15M1D9M2S"),expand=3) array([409, 410, 411, 412, 413, 414])
- trxtools.SAMgeneral.countRead(i=())
Takes tuple (position,CIGARstring) and returns list of mapped positions
- Parameters
i (tuple) – tuple (first position of the read ,CIGAR string)
- Returns
list of mapped positions
- Return type
np.array
>>> countRead((400,"3S15M1D9M2S")) array([400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424])
- trxtools.SAMgeneral.groupCIGAR(cigar_string='')
Split CIGAR string to list of tuples
- Parameters
cigar_string (str) –
- Returns
list of tuples
[( ),()]
- Return type
list
>>> groupCIGAR("3S44M1S1H") [('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]
- trxtools.SAMgeneral.noncoded2profile(df_input=Empty DataFrame Columns: [] Index: [], df_details=Empty DataFrame Columns: [] Index: [])
Turns non-coded ends into profile
- Parameters
df_input (DataFrame) – output of parseNoncoded function
df_details (DataFrame) – chromosome lengths
- Returns
DataFrame with profiles
- Return type
DataFrame
- trxtools.SAMgeneral.noncoded2profile1(df=Empty DataFrame Columns: [] Index: [], length=0)
Turns non-coded ends into profile
- Parameters
df_input (DataFrame) – output of parseNoncodedList function
df_details (DataFrame) – chromosome lengths
- Returns
Series with profiles
- Return type
Series
- trxtools.SAMgeneral.parseNoncoded(d={}, minLen=3)
Parse dict with non-coded ends and returns structured DataFrame
- Parameters
d (dict) – dictionary with list of tuples for each chromosme
{"chrI" : [], "chrII" : []}
, defaults to dict()minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3
- Returns
DataFrame with parsed non-coded ends
- Return type
DataFrame
>>> parseNoncoded({"chrI":[(40, 'AAA'), (35, 'AACAA')]}) index AAA AACAA chr 0 40 1.0 NaN chrI 1 35 NaN 1.0 chrI
- trxtools.SAMgeneral.parseNoncodedList(l=[], minLen=3)
Parse list with non-coded ends and returns structured DataFrame
- Parameters
l (list) – list of tuples
[(int,str)]
, defaults to lits()minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3
- Returns
DataFrame with parsed non-coded ends
- Return type
DataFrame
>>> parseNoncodedList([(40, 'AAA'), (35, 'AACAA')]) AAA AACAA 35 NaN 1.0 40 1.0 NaN
- trxtools.SAMgeneral.saveBigWig(paths={}, suffix='', bw_name='', chroms=[])
Save gzip pickle data to BigWig
- Parameters
paths (_type_, optional) – _description_, defaults to dict()
suffix (_type_, optional) – _description_, defaults to str()
bw_name (_type_, optional) – _description_, defaults to str()
chroms (_type_, optional) – _description_, defaults to list()
- Returns
_description_
- Return type
_type_
- trxtools.SAMgeneral.selectEnds(df=Empty DataFrame Columns: [] Index: [], ends='polyA')
Wrapper for functions selecting non-coded ends
- Parameters
df (DataFrame) – output of parseNoncoded or parseNoncodedList
ends (str) – type of ends, currently only “polyA” is availible, defaults to “polyA”
- Returns
runs selectPolyA
- Return type
DataFrame
- trxtools.SAMgeneral.selectPolyA(df=Empty DataFrame Columns: [] Index: [])
Select only polyA non-coded ends containinig
"AAA"
and “A”-content above 75%- Parameters
df (DataFrame) – output of parseNoncoded or parseNoncodedList
- Returns
modified DataFrame
- Return type
DataFrame
- trxtools.SAMgeneral.selectSortPaths(paths={}, chroms=[], suffix='')
pyBigWig requires for input sorted chromsomes
- Parameters
paths (dict, optional) – _description_, defaults to {}
chroms (list, optional) – _description_, defaults to []
suffix (str, optional) – _description_, defaults to “”
- Returns
_description_
- Return type
_type_
- trxtools.SAMgeneral.stripCIGAR(match=[], to_strip='H')
Removes H from output of groupCIGAR
- Parameters
match (list) – output of groupCIGAR, defaults to []
to_strip (str, optional) – CIGAR mark to be stripped, defaults to “H”
- Returns
modified list of tuples
- Return type
list
- trxtools.SAMgeneral.stripSubstitutions(match)
Strip substutiotns on both ends
- Parameters
match (list) – output of groupCIGAR, defaults to []
- Returns
list of tuples
- Return type
list
- trxtools.SAMgeneral.tostripCIGARfive(match=[])
Calculates length of soft-clipped nucleotides at the 5’ end of the read
- Parameters
match (list) – output of groupCIGAR, defaults to []
- Returns
number of substituted nucleotides
- Return type
int
>>> tostripCIGARfive([('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]) 3
- trxtools.SAMgeneral.tostripCIGARthree(match=[])
Nucleotides without alignment at the 3’ end of the read
- Parameters
match (list) – output of groupCIGAR, defaults to []
- Returns
number of substituted nucleotides
- Return type
int
>>> tostripCIGARthree([('3', 'S'), ('44', 'M'), ('1', 'S')]) 1
trxtools.SAMgenome module
- trxtools.SAMgenome.chromosome2profile3end(l=[], length=0, strand='FWD')
Generates profile for the 3’ ends of reads and saves position of non-coded end
- Parameters
l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []
length (int) – length of chromosome
strand (str, optional) –
'FWD'
or'REV'
, defaults to ‘FWD’
- Returns
profile, noncoded
- Return type
np.array, list of tuples
>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD") (0 0.0 1 0.0 ... 34 0.0 35 1.0 36 0.0 ... 50 0.0 dtype: float64, [(35, 'CC')]) >>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV") (0 0.0 1 0.0 ... 39 0.0 40 1.0 41 0.0 ... 50 0.0 dtype: float64, [(40, 'AAA')])
- trxtools.SAMgenome.parseHeader(filename, name, dirPath)
- trxtools.SAMgenome.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='read', logName='')
Function used by sam2genome. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
- Returns
output_df_fwd, output_df_rev, log
- Return type
DataFrame, DataFrame, list
- trxtools.SAMgenome.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='3end', noncoded=True, ends='polyA', logName='', minLen=3)
Function used by sam2genome3end. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True
- Returns
output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev
- Return type
DataFrame, DataFrame, list, DataFrame, DataFrame
- trxtools.SAMgenome.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='5end', logName='')
Function used by sam2genome5end. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
- Returns
output_df_fwd, output_df_rev, log
- Return type
DataFrame, DataFrame, list
- trxtools.SAMgenome.sam2genome(filename='', path='', toClear='', chunks=0, use='3end', noncoded=True, ends='polyA')
Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.
- Parameters
filename (str) –
path (str) –
toClear (str, optional) – element of filename to be removed, defaults to ‘’
chunks (int, optional) – Read SAM file in chunks, defaults to 0
noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True
noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False
trxtools.SAMgenome_old module
- trxtools.SAMgenome_old.chromosome2profile3end(l=[], length=0, strand='FWD')
Generates profile for the 3’ ends of reads and saves position of non-coded end
- Parameters
l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []
length (int) – length of chromosome
strand (str, optional) –
'FWD'
or'REV'
, defaults to ‘FWD’
- Returns
profile, noncoded
- Return type
np.array, list of tuples
>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD") (0 0.0 1 0.0 ... 34 0.0 35 1.0 36 0.0 ... 50 0.0 dtype: float64, [(35, 'CC')]) >>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV") (0 0.0 1 0.0 ... 39 0.0 40 1.0 41 0.0 ... 50 0.0 dtype: float64, [(40, 'AAA')])
- trxtools.SAMgenome_old.parseHeader(filename, name, dirPath)
- trxtools.SAMgenome_old.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])
Function used by sam2genome. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
- Returns
output_df_fwd, output_df_rev, log
- Return type
DataFrame, DataFrame, list
- trxtools.SAMgenome_old.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], noncoded=True)
Function used by sam2genome3end. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True
- Returns
output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev
- Return type
DataFrame, DataFrame, list, DataFrame, DataFrame
- trxtools.SAMgenome_old.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])
Function used by sam2genome5end. Works for both strands.
- Parameters
name (str) – name of experiment
dirPath (str) –
df_details (DataFrame) – lengths of chromosomes
- Returns
output_df_fwd, output_df_rev, log
- Return type
DataFrame, DataFrame, list
- trxtools.SAMgenome_old.sam2genome(filename='', path='', toClear='', pickle=False, chunks=0)
Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.
- Parameters
filename (str) –
path (str) –
toClear (str, optional) – element of filename to be removed, defaults to ‘’
pickle (bool, optional) – save output in pickle format, defaults to False
chunks (int, optional) – Read SAM file in chunks, defaults to 0
- trxtools.SAMgenome_old.sam2genome3end(filename='', path='', toClear='', pickle=False, chunks=0, noncoded_pA=True, noncoded_raw=False)
Function handling SAM files and generating profiles for the 3’ end of reads. Executed using wrapping script SAM2profilesGenomic.py.
- Parameters
filename (str) –
path (str) –
toClear (str, optional) – element of filename to be removed, defaults to ‘’
pickle (bool, optional) – save output in pickle format, defaults to False
chunks (int, optional) – Read SAM file in chunks, defaults to 0
noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True
noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False
- trxtools.SAMgenome_old.sam2genome5end(filename='', path='', toClear='', pickle=False, chunks=0)
Function handling SAM files and generating profiles for the 3’ end of reads. Executed using wrapping script SAM2profilesGenomic.py.
- Parameters
filename (str) –
path (str) –
toClear (str, optional) – element of filename to be removed, defaults to ‘’
pickle (bool, optional) – save output in pickle format, defaults to False
chunks (int, optional) – Read SAM file in chunks, defaults to 0
noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True
noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False
trxtools.SAMtranscripts module
- trxtools.SAMtranscripts.reads2profile(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])
Takes list of aligned reads and transform to profile for each transctipt. Tested only for PLUS strand.
- trxtools.SAMtranscripts.reads2profileDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], expand=5)
Tested only for PLUS strand.
- trxtools.SAMtranscripts.sam2profiles(filename='', path='', geneList=[], toClear='', df_details=Empty DataFrame Columns: [] Index: [], deletions=False, expand=5, pickle=False, chunks=0)
Function handling SAM files and generating profiles. Executed using wrapping script SAM2profiles.py.
- Parameters
filename (str) –
path (str) –
geneList (list) – list of transcript to be selected
toClear (str, optional) – element of filename to be removed, defaults to ‘’
df_details (DataFrame) – Details of transcripts
deletions (bool, optional) – Generate profile of deletions, defaults to False
expand (int, optional) – Expand deletions, defaults to 5
pickle (bool, optional) – save output in pickle format, defaults to False
chunks (int, optional) – Read SAM file in chunks, defaults to 0
- trxtools.SAMtranscripts.transcript2profile(l=[], length=0)
Takes list of tuples position,CIGARstring) and generates profile. Works with entire reads, for both strands.
- trxtools.SAMtranscripts.transcript2profileDeletions(l=[], expand=0, length=0)
Takes list of tuples position,CIGARstring) and generates profile. Not tested for MINUS strand.
trxtools.assays module
- trxtools.assays.bulkInputIDT(data=Empty DataFrame Columns: [] Index: [])
Tranforms dataframe with colums “template” and “non-template” and prepares table to be used as bulk input discards oligos longer than 200 nt.
- Parameters
data – DataFrame()
- Returns
DataFrame with sequences to order
- trxtools.assays.extruded(seq='', buried=13)
- Parameters
seq – str()
buried – int() length of sequence buried within RNAP
- Returns
Extruded sequence, str
- trxtools.assays.findPrimer(seq='', query='')
Find query sequence within given sequence
- Parameters
seq – str() containing sequence
query – str() within searched sequences
- Returns
(start,stop)
- trxtools.assays.nonTemplateDNA(seq='', primer='')
- Parameters
seq – str() sequence
primer – str() sequence
- Returns
DNA sequence of non-template strand, str
- trxtools.assays.sequenceConstrain(structure='', stall='', RNAprimer='')
Returns sequence constrained with the 5prime RNA primer
- Parameters
structure – str()
stall – str() single letter
RNAprimer – str()
- Returns
str() with sequence constrains for folding algorithm
- trxtools.assays.stalled(seq='', stall='AAA', primer='')
Finds and returns stalled sequence
- Parameters
seq – str()
stall – str() default “AAA”
primer – str()
- Returns
stalled sequence, str
- trxtools.assays.structureFile(structure='', stall='', RNAprimer='')
Saves structure file in current directory
- Parameters
structure – str()
stall – str()
RNAprimer – str()
- Returns
True
- trxtools.assays.templateDNA(seq='', overhang5end='')
- Parameters
seq – str() sequence of RNA
overhang5end – str() sequence of the 5’end DNA ovethang
- Returns
DNA sequence of template strand, str
- trxtools.assays.testScaffold(data=Empty DataFrame Columns: [] Index: [], overhang5end='', RNA_primer='')
Prints scaffold to test before ordering
- Parameters
data – DataFrame()
overhang5end – str() overhang to shift sequences
RNA_primer – str() sequence
- Returns
- trxtools.assays.toOrder(data=Empty DataFrame Columns: [] Index: [], buried='GUCUGUUUUGUGG', stallFull='AAA', afterStall='TGATCGGTAC', overhang5end='TGA', RNA_primer='AGGCCGAAA', bulkInput=True, test=True, lengthMax=200)
Transfers folding function to DNA sequences
- Parameters
data – DataFrame()
buried – str() sequence
stallFull – str() sequence
afterStall – str() sequence
overhang5end – str() sequence
RNA_primer – str() sequence
bulkInput – boolean() default True
test – boolean() default True
lengthMax – int() default 200
- Returns
DataFrame of sequences to order
trxtools.go_enrichment module
- trxtools.go_enrichment.get_enrichment(query_genes, organism, use_reference_set=False, ref_genes=None, ref_organism=None, go_dataset='biological_process', test_type='FISHER', correction='FDR')
Run a GO term enrichment test using PANTHER API
- Parameters
query_genes (list) – List of sequence identifiers of queried genes (e.g. transcript ids, gene ids)
organism (str) – Taxid of query species (e.g. “9606” for H. sapiens)
use_reference_set (bool) – Use a custom set of rerence (background) genes? Default False. If True, ref_genes and ref_species need to be specifed.
ref_genes (list) – optional list of reference genes. Specifying None (default) will use the whole genome of species specified in organism. When passing a list, ref_organism taxid must also be provided.
ref_species (str) – Taxid of reference species, required when ref_genes is not None
go_dataset (str) – Which annotation dataset to query, “biological_process” or “molecular_function”
test_type (str) – Which tatistical test to use. Available: “FISHER” (default), “BINOMIAL”
correction (str) – Which multiple testing correction method to use. Available: “FDR” (default), “BONFERRONI”, “NONE”
- Returns
Unfiltered DataFrame of results.
- Return type
pandas.DataFrame:
trxtools.methods module
- trxtools.methods.addCluster(df=Empty DataFrame Columns: [] Index: [], n=10)
Assigns n clusters to the data using KMeans algorithm :param df: DataFrame :param n: no. of clusters, int :return:
- trxtools.methods.bashCommand(bashCommand='')
Run command in bash using subprocess.call()
- trxtools.methods.calGC(dataset=Empty DataFrame Columns: [] Index: [], calFor=['G', 'C'])
Returns GC content in a given string - uses [‘nucleotide’] column
- Parameters
dataset – DataFrame() with “nucleotide” column
- Returns
fraction of GC content, float
- trxtools.methods.cleanNames(df=Empty DataFrame Columns: [] Index: [], additional_tags=[])
Cleans some problems with names if exist
- Parameters
df – DataFrame() where names of columns are name of experiments
additional_tags – list()
- Returns
DataFrame() with new names
- trxtools.methods.define_experiments(paths_in, whole_name=False, strip='_hittable_reads.txt')
Parse file names and extract experiment name from them
- Parameters
paths_in – str()
whole_name – boolean() default False. As defaults script takes first ‘a_b_c’
strip – str() to strip from filename.
- Returns
list() of experiment names, list() of paths.
- trxtools.methods.expNameParser(name, additional_tags=[], order='b_d_e_p')
Function handles experiment name; recognizes AB123456 as experiment date; BY4741 or HTP or given string as bait protein
- Parameters
name –
additional_tags – list of tags
output – default ‘root’ ; print other elements when ‘all’
order – defoult ‘b_d_e_p’ b-bait; d-details, e-experiment, p-prefix
- Returns
list of reordered name
- trxtools.methods.expStats(input_df=Empty DataFrame Columns: [] Index: [], smooth=True, window=10, win_type='blackman')
Returns DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments
- Parameters
input_df – DataFrame
smooth – boolean, if True apply smoothing window, default=True
window – int, smoothing window, default 10
win_type – str type of smoothing window, default “blackman”
- Returns
DataFrame
- trxtools.methods.filterExp(datasets, let_in=[''], let_out=['wont_find_this_string'], verbose=False)
Returns object with filtered columns/keys.
- Parameters
datasets – DataFrame() or dict() with exp name as a key
let_in – list() with elements of name to filter in
let_out – list() with elements of name to filter out
- Returns
DataFrame() or dict()
- trxtools.methods.groupCRACsamples(df=<class 'pandas.core.frame.DataFrame'>, use='protein', toDrop=[])
Parse CRAC names and annotates them using on of following features [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’,’sampleRep’]
- Parameters
df – DataFrame
use – str, choose from [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’,’sampleRep’], default = ‘protein’
toDrop – list of word in CRAC name that will qualify the sample to rejection, default = []
- Returns
DataFrame with added column [‘group’]
- trxtools.methods.indexOrder(df=Empty DataFrame Columns: [] Index: [], additional_tags=[], output='root', order='b_d_e_p')
Apply expNameParser to whole DataFrame
- Parameters
df – DataFrame() where names of columns are name of experiments
additional_tags – list()
output –
order – str() default ‘b_d_e_p’ b-bait; d-details, e-experiment, p-prefix
- Returns
DataFrame() with new names
- trxtools.methods.letterContent(s='', letter='A')
- trxtools.methods.list_paths_in_current_dir(suffix='', stdin=False)
- Parameters
suffix – str() lists paths in current directory ending with an indicated suffix only
stdin – boolean() if True read from standard input instead current directory
- Returns
list() of paths
- trxtools.methods.normalize(df=<class 'pandas.core.frame.DataFrame'>, log2=False, pseudocounts=0.1)
- Parameters
df – DataFrame
log2 – boolean, default=False
pseudocounts – float, default=0.1
- Returns
- trxtools.methods.parseCRACname(s1=<class 'pandas.core.series.Series'>)
- Parse CRAC name into [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’] using this order.
“_” is used to split the name
- Parameters
s1 – Series,
- Returns
DataFrame
- trxtools.methods.quantileCategory(s1=Series([], dtype: float64), q=4)
Quantile-based discretization function based on pandas.qcut function.
- Parameters
s1 – Series()
q – int() number of quantiles: 10 for deciles, 5 for quantiles, 4 for quartiles, etc., default q=4
- Returns
Series
- trxtools.methods.randomDNAall(length=0, letters='CGTA')
Generates all possible random sequences of a given length
- Parameters
length – int()
letters – str() with letters that will be used
- Returns
list() of str()
- trxtools.methods.randomDNAsingle(length=0, letters='CGTA')
Random generator of nucleotide sequence
- Parameters
length – int()
letters – str() with letters that will be used
- Returns
str()
- trxtools.methods.readSalmon(nameElem='', path='', toLoad='', toClear=[], toAdd='', column='NumReads', df=None, overwrite=False)
- Parameters
nameElem – str, elem to load
path – str
toLoad – str, additional param for filtering, by default equal to nameElem
toClear – str
toAdd – str
df – pd.DataFrame
overwrite – boolean, default=False
- Returns
- trxtools.methods.read_HTSeq_output(path='', toLoad='classes', toClear=[], toAdd='', df=None, overwrite=False)
Reads multiple HTSeq tab files to one DataFrame
- Parameters
path – str, path to directory with files
toClear – str, will be removed from file name
toAdd – str, to be added to file name
df – DataFrame, to be appended; default=None
overwrite – boolean, allows for overwriting during appending, default = False
- Returns
DataFrame
- trxtools.methods.read_STARstats(path='', toClear=[], toAdd='', df=None, overwrite=False)
Reads multiple HTSeq tab files to one DataFrame
- Parameters
path – str, path to directory with files
toClear – str, will be removed from file name
toAdd – str, to be added to file name
df – DataFrame, to be appended; default=None
overwrite – boolean, allows for overwriting during appending, default = False
- Returns
DataFrame
- trxtools.methods.read_featureCount(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)
Read tab files with common first column :param nameElem: str, present in all files :param path: str, path to directory with files :param toLoad: str, to be present in file name (optional) :param toClear: str, will be removed from file name :param toAdd: str, to be added to file name :param df: DataFrame, to be appended; default=None :param overwrite: boolean, allows for overwriting during appending, default = False :return: DataFrame
- trxtools.methods.read_list(filepath='')
Read list from file. Each row becomes item in the list.
- Parameters
filepath – str
- Returns
list
- trxtools.methods.read_tabFile(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)
Read tab files with common first column :param nameElem: str, present in all files :param path: str, path to directory with files :param toLoad: str, to be present in file name :param toClear: str, will be removed from file name :param toAdd: str, to be added to file name :param df: DataFrame, to be appended; default=None :param overwrite: boolean, allows for overwriting during appending, default = False :return: DataFrame
- trxtools.methods.reverse_complement(seq)
Reverse complement
- Parameters
seq – str
- Returns
str
- trxtools.methods.reverse_complement_DNA(seq)
Reverse complement
- Parameters
seq – str
- Returns
str
- trxtools.methods.reverse_complement_RNA(seq)
Reverse complement
- Parameters
seq – str
- Returns
str
- trxtools.methods.rollingGC(s=<class 'pandas.core.series.Series'>, window=10)
Calculates GC from sequence, uses ‘boxcar’ window
- Parameters
s – Series containing sequence
window – window size for GC calculation
- Returns
Series with GC calculated, center=False
- trxtools.methods.runPCA(data=Empty DataFrame Columns: [] Index: [], n_components=2)
Run PCA analysis and re-assigns column names and index names
- Parameters
data – DataFrame
n_components – int, default 2
- Returns
tuple consisting of DataFrame with PCA results and a list of PC values
- Return type
tuple
- trxtools.methods.timestamp()
- Returns
timestamp as a str()
- trxtools.methods.timestampRandomInt()
- Returns
timestamp and random number as a str()
trxtools.nascent module
- class trxtools.nascent.Fold(tempDir=None)
Bases:
object
- RNAfold(data, saveData=False, temp=None)
Calculates dG using RNAfold (ViennaRNA)
- Parameters
data – input data {list, Series, DataFrame with “name” column}
saveData – boolean, default False
temp – int, default None
- Returns
DataFrame
- RNAinvert(structure='', saveData=False, temp=None, n=5, RNAprimer='', stall='', quick=False)
Returns n sequences with with given structure
- Parameters
structure – str with secondary RNA structure
saveData – boolean, default False
temp – int, default None
n – int number of output sequences, default 5
RNAprimer – str sequence
stall – str nucleotide
quick – boolean if False uses -Fmp -f 0.01 params, default False
- Returns
DataFrame
- RNAinvertStall(structure='', RNAprimer='', stall='A', n=200)
Returns sequences without nt that is present in stall
- Parameters
structure – str with secondary RNA structure
RNAprimer – str sequence
stall – str sequence, default “A”
n – int, default 200
- Returns
DataFrame
- UNAfold(data, saveData=False, temp=None)
Calculates dG using UNAfold
- Parameters
data – input data {list, Series, DataFrame with “name” column}
saveData – boolean, default False
temp – int, default None
- Returns
DataFrame
- bashFolding(method='RNA')
Runs RNA folding using bash
- Parameters
method – “RNA” for ViennaRNA or “UNA” for UNAfold”
- Returns
- class trxtools.nascent.Hybrid(tempDir=None)
Bases:
object
- RNAhybrid(data, saveData=False, temp=None)
Calculates dG using hybrid-min
- Parameters
data – input data {list, Series, DataFrame with “name” column}
saveData – boolean, default False
temp – int, default None
- Returns
DataFrame
- bashHybrid()
Runs hybris-min using bash
- Returns
- trxtools.nascent.analyseViennamarkGC(vienna='', sequence='')
Leaves only C and G in stem structures
- Parameters
vienna – str with vienna format
sequence – str sequence
- Returns
str
- trxtools.nascent.extendingWindow(sequence='', name='name', strand='plus', temp=30, m=7)
Returns DataFrame of sequences of all possible lengths between minimum (m) and length of input sequence -1
- Parameters
sequence – str
name – str, default “name”
strand – str {“plus”,”minus”,”both”}, default “plus” (not tested for others)
temp – int, default 30
m – int, default 7 - RNAfold does not return any values for length shorter than 7 even at 4 deg C
- Returns
DataFrame of sequences with names according to nascent.slidingWindow convention
- trxtools.nascent.foldNascentElem(data=Empty DataFrame Columns: [] Index: [])
Fold the very 3’ of nascent elements.
- Parameters
data – DataFrame
- Returns
DataFrame
- trxtools.nascent.handleInput(data, keepNames=True)
Input data with columns: “seq” or “sequence” and “name” (optional)
- Parameters
data – {str, list, Series, DataFrame}
:param keepNames if True use given names, default True :return: DataFrame where index become name of sequence to fold
- trxtools.nascent.join2d(df=Empty DataFrame Columns: [] Index: [], use='format')
- Parameters
df – DataFrame
use – str, default “format”
- Returns
DataFrame
- trxtools.nascent.markVienna(df=Empty DataFrame Columns: [] Index: [])
Apply analyseViennamarkGC
- Parameters
df – DataFrame
- Returns
DataFrame
- trxtools.nascent.merge2d(df=<class 'pandas.core.frame.DataFrame'>)
- Parameters
df – DataFrame
- Returns
DataFrame
- trxtools.nascent.name2index(s1=Series([], dtype: object))
Extracts position from sequence name
- Parameters
s1 – Series with names from prepareNascent function
- Returns
Series with positions
- trxtools.nascent.nascentElems(vienna='', sequence='')
Describe elements of secondary structure: stems, multistems.
- Parameters
vienna – str
sequence – str
- Returns
str
- trxtools.nascent.nascentElemsDataFrame(data=Empty DataFrame Columns: [] Index: [])
Apply nascentElem to DataFrame of folded sequences
- Parameters
data – DataFrame containing df[‘vienna’] and df[‘sequence’]
- Returns
DataFrame
- trxtools.nascent.nascentFolding(sequence='', temp=30, window=100)
Combines folding function: fold RNA, locate last nascent element and calculate dG of it.
- Parameters
sequence – str
temp – int, default 30
window – int, default 100
- Returns
DataFrame
- trxtools.nascent.parseFoldingName(df=Empty DataFrame Columns: [] Index: [])
- trxtools.nascent.prepareNascent(sequence='', name='name', strand='plus', temp=30, window=100)
Divide long transcript into short sequences. Combines output of extendingWindow and slidingWindow.
- Parameters
sequence – str
name – str, default “name”
strand – str {plus,minus,both}, default “plus” (not tested for others)
temp – int, default 30
window – int, default 100
- Returns
Dataframe with sequences
- trxtools.nascent.selectFoldedN(data=Empty DataFrame Columns: [] Index: [], n=5, pattern='(((((....)))))')
Takes Fold().RNAFold() df as an input. Selects n rows with a given pattern on the 5end and most different folding energy
- Parameters
data – DataFrame
n – int samples, default 5
pattern – str vienna format, default “(((((….)))))”
- Returns
DataFrame
- trxtools.nascent.slidingWindow(sequence='', name='name', strand='plus', temp=30, window=100)
Slices sequence using sliding window
- Parameters
sequence – str
name – str default “name”
strand – str {“plus”,”both”,”minus”} default “plus”
temp – int default 30
window – int default 80
- Returns
DataFrame with sliding windows
trxtools.plotting module
- trxtools.plotting.clusterClusterMap(df)
Clustermap for clusters :param df: DataFrame :return:
- trxtools.plotting.hplotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.hplotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.hplotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.hplotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.hplotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.hplotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotCumulativePeaks(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, title='', start=None, stop=None, window=50, figsize=(4, 3), color1='green', color2='magenta', lc='red')
Plot single gene peaks metaplot.
- Parameters
ref – str with path to csv file or DataFrame
df2 – DataFrame
local_pos – list of features (peaks/troughs)
dpi – int, default 150
title – str
start – int
stop – int
window – int, default 50
figsize – tuple, default (4,3)
color1 – str, default “green”
color2 – str, default “magenta”
lc – str, default “red”
- Returns
- trxtools.plotting.plotPCA(data=Empty DataFrame Columns: [] Index: [], names=[], title='', PClimit=1, figsize=(7, 7), PCval=[])
Plot PCA plot
- Parameters
data – DataFrame
names – list of names to annotate
title – str
PClimit – int number of PC to plot, default 1
figsize – tuple, default (7,7)
- Returns
>>> plotPCA(methods.runPCA(example_df)[0])
- trxtools.plotting.plotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
- Parameters
df – DataFrame
dpi – int, default=150
- Returns
- trxtools.plotting.plot_as_box_plot(df=Empty DataFrame Columns: [] Index: [], title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', h_lines=[], lc='red', offset=0)
Plots figure similar to box plot: median, 2 and 3 quartiles and min-max range
- Parameters
df – Dataframe() containing following columns:
`['position'] ['mean'] ['median'] ['std']`
optionally`['nucleotide'] ['q1'] ['q3'] ['max'] ['min']`
title – str
start – int
stop – int
figsize – tuple, default (7,4)
ylim – tuple OY axes lim. Default (None,0.01)
dpi – int, default 150
color – str, default “green”
h_lines – list of horizontal lines
lc – str color of horizontal lines, default “red”
offset – int number to offset position if 5’ flank was used, default 0
- Returns
- trxtools.plotting.plot_diff(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm', label='', start=None, stop=None, plot_medians=True, plot_ranges=True, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], offset=0)
Plot given dataset and reference, differences are marked
- Parameters
ref – str with path to csv file or DataFrame
dataset – DataFrame containing following columns:
`['position'] ['mean'] ['median'] ['std']`
optionally`['nucleotide'] ['q1'] ['q3'] ['max'] ['min']`
ranges – str “mm” : min-max or “qq” : q1-q3
label – str
start – int
stop – int
plot_medians – boolean if True plot medians, default True
plot_ranges – boolean if True plot ranges, default True
figsize – tuple, default (7,3)
ylim – tuple OY axes lim, default (None,0.01)
h_lines – list of horizontal lines
- Returns
- trxtools.plotting.plot_heatmap(df=Empty DataFrame Columns: [] Index: [], title='Heatmap of differences between dataset and reference plot for RDN37-1', vmin=None, vmax=None, figsize=(20, 10))
Plot heat map of differences, from dataframe generated by compare1toRef(dataset, heatmap=True) function
- Parameters
df – DataFrame
title – str
vmin –
vmax –
figsize – tuple, default (20,10)
- Returns
- trxtools.plotting.plot_to_compare(ref, df=Empty DataFrame Columns: [] Index: [], color1='green', color2='black', ref_label='', label='', title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', dpi=150, offset=300)
Figure to compare to plots similar to box plot: median, 2 and 3 quartiles and min-max range
- Parameters
ref – str with path to csv file or DataFrame
df – DataFrame
color1 – str, default “green”
color2 – str, default “black”
ref_label – str
label – str
title – str
start – int
stop – int
figsize – tuple, default (7,4)
ylim – tuple OY axes lim. Default (None,0.01)
h_lines – list of horizontal lines
lc – str color of horizontal lines, default “red”
dpi – int, default 150
offset – int number to offset position if 5’ flank was used, default 0
- Returns
trxtools.profiles module
- trxtools.profiles.FoldingFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, ranges=0, offset=15, fold='dG65nt@30C')
Pulls folding information from BigWig folding data for a given gene.
- Parameters
gene_name – str
gtf – pyCRAC.GTF2 object with GTF and TAB files loaded
bwFWD – dict of pyBigWig objects
bwREV – dict of pyBigWig objects
ranges – int flanks to be added for the gene, default 0
offset – int to offset folding data, default 15
fold – name of output column, default=”dG65nt@30C”
- Returns
DataFrame
- trxtools.profiles.calculateFDR(data=Series([], dtype: float64), iterations=100, target_FDR=0.05)
Calculates False Discovery Rate (FDR) for a given dataset.
- Parameters
data – Series
iterations – int, default 100
target_FDR – float, detault 0.05
- Returns
Series
- trxtools.profiles.compare1toRef(ref, dataset=Series([], dtype: float64), ranges='mm', heatmap=False, relative=False)
Takes Series and compare this with reference DataFrame()
- Parameters
ref – str with path to csv file or DataFrame
dataset – Series
ranges – mm : min-max or qq : q1-q3
heatmap – boolean, heatmap=False: Dataframe with(reference_above_experiment minimum etc.): rae_min, rae_max, ear_min, ear_max; heatmap=True: Series of differences to plot heatmap
relative – boolean, only for heatmap, recalculates differences according to the peak size. Warning: negative values are in range -1 to 0 but positive are from 0 to values higher than 1
- Returns
Dataframe (heatmap=False) or Series (heatmap=True)
- trxtools.profiles.compareMoretoRef(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm')
Takes Dataframe created by filter_df and compare this with reference DataFrame
- Parameters
ref – str with path to csv file or DataFrame
dataset – Series
ranges – mm : min-max or qq : q1-q3
- Returns
Dataframe
- trxtools.profiles.dictBigWig(files=[], path='', strands=True)
Preloads BigWig files to memory using pyBigWig tools
- Parameters
files – list of files
path – str
strands – boolean, default True
- Returns
dict or dict, dict
- trxtools.profiles.findPeaks(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)
Find local extrema using SciPy argrelextrema function
- Parameters
s1 – Series data to localize peaks
window – int, To smooth data before peak-calling. default 1 (no smoothing)
win_type – str type of smoothing window, default “blackman”
order – int minimal spacing between peaks, argrelextrema order parameter, default 20
- Returns
list of peaks
- trxtools.profiles.findTroughs(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)
Find local minima using SciPy argrelextrema function
- Parameters
s1 – Series data to localize peaks
window – int, To smooth data before trough-calling. default 1 (no smoothing)
win_type – str type of smoothing window, default “blackman”
order – int minimal spacing between min, argrelextrema order parameter, default 20
- Returns
list of troughs
- trxtools.profiles.geneFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, toStrip='', ranges=0)
Pulls genome coverage from BigWig data for a given gene. One BigWig file -> one column.
- Parameters
gene_name – str
gtf – pyCRAC.GTF2 object with GTF and TAB files loaded
bwFWD – dict of pyBigWig objects
bwREV – dict of pyBigWig objects
toStrip – str of name to be stripped
ranges – int flanks to be added for the gene, default 0
- Returns
DataFrame
- trxtools.profiles.ntotal(df=<class 'pandas.core.frame.DataFrame'>, drop=True)
Normalize data in DataFrame to fraction of total column
- Parameters
df – DataFrame
drop – boolean, if True drop ‘position’ and ‘nucleotide’ columns, default True
- Returns
DataFrame
- trxtools.profiles.parseConcatFile(path, gtf, use='reads', RPM=False, ranges=1000)
Parse concat file
- Parameters
path – str with path of the concat file
gtf – pyCRAC.GTF2 object with GTF and TAB files loaded
use – str with name of column tu use [‘reads’, ‘substitutions’, ‘deletions’], default “reads”
RPM – boolean, default False
ranges – int flanks to be added for the gene, default 0
- Returns
dict of DataFrames; using gene name as a key
- trxtools.profiles.preprocess(input_df=Empty DataFrame Columns: [] Index: [], let_in=[''], let_out=['wont_find_this_string'], stats=False, smooth=True, window=10, win_type='blackman')
Combines methods.filterExp and expStats. Returns DataFrame with choosen experiments, optionally apply smoothing and stats
- Parameters
input_df – DataFrame
let_in – list of words that characterize experiment, default [‘’]
let_out – list of words that disqualify experiments, default [‘wont_find_this_string’]
stats – boolean, if True return stats for all experiments, default False
smooth – boolean, if True apply smoothing window, default True
window – int smoothing window, default 10
win_type – str type of smoothing window, default “blackman”
- Returns
DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments
- trxtools.profiles.pseudocounts(df=<class 'pandas.core.frame.DataFrame'>, value=0.01, drop=True)
Add pseudocounts to data
- Parameters
df – DataFrame
value – float, default 0.01
drop – boolean, if True drop ‘position’ and ‘nucleotide’ columns, default True
- Returns
DataFrame
- trxtools.profiles.save_csv(data_ref=Empty DataFrame Columns: [] Index: [], datasets=Empty DataFrame Columns: [] Index: [], path=None)
Saves Dataframe to csv
- Parameters
data_ref – DataFrame with
['position']
and['nucleotide']
columnsdatasets – DataFrame containinig experimental data only
path – str, Optional: path to save csv. Default None
- Returns
DataFrame
- trxtools.profiles.stripBigWignames(files=[])
Strip “_rev.bw” and “_fwd.bw” form file names
- Parameters
files – list of filenames
- Returns
list of unique names
trxtools.secondary module
- trxtools.secondary.Lstem(vienna='')
Returns list of positions where “(” is found using coordinates {1,inf}
- Parameters
vienna – str
- Returns
list
- trxtools.secondary.Rstem(vienna='')
Returns list of positions where “)” is found using coordinates {1,inf}
- Parameters
vienna – str
- Returns
list
- trxtools.secondary.checkVienna(sequence='', vienna='')
Validates integrity of vienna file
- Parameters
sequence – str
vienna – str
- Returns
True if pass
- trxtools.secondary.loopStems(vienna='', sequence='', loopsList=None, testPrint=False)
Returns postions of stem of single hairpins and multiloop stems. Use coordinates {1:inf}. Warninig: tested with single multiloop stems only
- Parameters
vienna – str
loopsList – list (option)
testPrint – boolean to default=False
- Returns
list, list (stems: list of tuples; multistems: list)
- trxtools.secondary.loops(vienna='')
Returns first positions outside the loop i.e. “.((….)).” returns [(3,8)]
- Parameters
vienna – vienna
- Returns
list of tuples
- trxtools.secondary.substructures(vienna='', sequence='')
list sub-structures of the given structure
- Parameters
vienna – str
sequence – str
- Returns
Series
- trxtools.secondary.test(vienna='', sequence='', loops=None, stems=None, multistems=None, linkers=None)
Prints vienna with given features
- Parameters
vienna – str
loops – list of tuples (option)
stems – list of tuples (option)
multistems – list (option)
linkers – list (option)
- Returns
None
- trxtools.secondary.vienna2format(vienna='', sequence='', loopsList=None, stemsList=None, multistemsList=None, testPrint=False)
Converts vienna format to letters: O - loop, S - stem, M - multiloop stem and L - linker
- Parameters
vienna – str
loopsList – list (optional)
stemsList – list (optional)
multistemsList – list (optional)
testPrint – defauls=False
- Returns
str in “format”