trxtools package

trxtools.sam module

trxtools.sam.SAMgeneral submodule

trxtools.sam.SAMgeneral.countDeletion(i=(), expand=0)

Takes tuple (position,CIGARstring) and returns list of mapped deletions. Each deletion is counted once, but expand parameter can merge some longer deletions (common expansion).

Parameters:

i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0

Returns:

list of mapped positions

Return type:

np.array

>>> countDeletion((400,"3S15M1D9M2S"))
array([415])
>>> countDeletion((400,"3S15M1D9M2S"),expand=3)
array([412, 413, 414, 415, 416, 417, 418])

trxtools.sam.SAMgeneral.countMiddle(i=(), expand=0)

Takes tuple (position,CIGARstring) and returns list with the middle of mapped read

Parameters:

i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0

Returns:

list of mapped positions

Return type:

np.array

>>> countMiddle((400,"3S15M1D9M2S"))
array([412])
>>> countMiddle((400,"3S15M1D9M2S"),expand=3)
array([409, 410, 411, 412, 413, 414])

trxtools.sam.SAMgeneral.countRead(i=())

Takes tuple (position,CIGARstring) and returns list of mapped positions

Parameters:: i (tuple) – tuple (first position of the read ,CIGAR string)
Returns:: list of mapped positions
Return type:: np.array

>>> countRead((400,"3S15M1D9M2S"))
array([400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412,
   413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424])

trxtools.sam.SAMgeneral.groupCIGAR(cigar_string='')

Split CIGAR string to list of tuples

Parameters:: cigar_string (str)
Returns:: list of tuples [( ),()]
Return type:: list

>>> groupCIGAR("3S44M1S1H")
[('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]

trxtools.sam.SAMgeneral.noncoded2profile(df_input=Empty DataFrame Columns: [] Index: [], df_details=Empty DataFrame Columns: [] Index: [])

Turns non-coded ends into profile

Parameters:

df_input (DataFrame) – output of parseNoncoded function
df_details (DataFrame) – chromosome lengths

Returns:

DataFrame with profiles

Return type:

DataFrame

trxtools.sam.SAMgeneral.noncoded2profile1(df=Empty DataFrame Columns: [] Index: [], length=0)

Turns non-coded ends into profile

Parameters:

df_input (DataFrame) – output of parseNoncodedList function
df_details (DataFrame) – chromosome lengths

Returns:

Series with profiles

Return type:

Series

trxtools.sam.SAMgeneral.saveBigWig(paths={}, suffix='', bw_name='', chroms=[], pkl=True)

Save gzip pickle data to BigWig This function saves data from gzip pickle files to a BigWig file.

Parameters:

paths (dict) – Dictionary with paths to gzip pickle files, where keys are file paths and values are file contents.
suffix (str, optional) – Suffix to be removed from chromosome names in the paths, defaults to an empty string.
bw_name (str) – Name of the output BigWig file.
chroms (list of tuples) – List of tuples with chromosome names and their lengths.

Returns:

Success message indicating the suffix and that the file was saved successfully.

Return type:

str

Example:

>>> paths = {'/path/to/file_chr1.pkl.gz': 'data1', '/path/to/file_chr2.pkl.gz': 'data2'}
>>> suffix = '3end'
>>> bw_name = 'output.bw'
>>> chroms = [('chr1', 1000000), ('chr2', 2000000)]
>>> saveBigWig(paths, suffix, bw_name, chroms)
'3end saved successfully'

trxtools.sam.SAMgeneral.selectEnds(df=Empty DataFrame Columns: [] Index: [], ends='polyA')

Wrapper for functions selecting non-coded ends

Parameters:

df (DataFrame) – output of parseNoncoded or parseNoncodedList
ends (str) – type of ends, currently only “polyA” is availible, defaults to “polyA”

Returns:

runs selectPolyA

Return type:

DataFrame

trxtools.sam.SAMgeneral.selectNoncodedAndProfile(l=[], minLen=3, tail='AAA', letter='A', content=0.8)

Select sequences with non-coded ends and profile their occurrence.

Parameters:

l (list) – list of tuples where each tuple contains a position of 3’ end of read match and a sequence string, defaults to []
minLen (int, optional) – minimum length of the sequence to be considered, defaults to 3
tail (str, optional) – the non-coded end sequence to look for, defaults to “AAA”
letter (str, optional) – the letter to check the content of in the sequence, defaults to “A”
content (float, optional) – the minimum content threshold of the letter in the sequence, defaults to >=0.80

Returns:

a pandas Series profiling the occurrence of selected sequences

Return type:

pd.Series

Example:

>>> l = [(1, "AAA"), (2, "AACAA")]
>>> selectNoncodedAndProfile(l, minLen=3,tail="AAA",letter="A",content=0.80)
1    1

trxtools.sam.SAMgeneral.selectSortPaths(paths={}, chroms=[], suffix='')

Select and sort paths based on chromosome names and suffix This function filters and sorts the paths dictionary based on the provided chromosome names and suffix.

Parameters:

paths (dict) – Dictionary with paths to gzip pickle files, where keys are file paths and values are file contents.
chroms (list of tuples) – List of tuples with chromosome names and their lengths.
suffix (str, optional) – Suffix to be matched in the file paths, defaults to an empty string.

Returns:

Dictionary with filtered and sorted paths.

Return type:

dict

Example:

>>> paths = {'/path/to/file_chr1_3end.pkl.gz': 'data1', '/path/to/file_chr2_3end.pkl.gz': 'data2'}
>>> chroms = [('chr1', 1000000), ('chr2', 2000000)]
>>> suffix = '3end'
>>> selectSortPaths(paths, chroms, suffix)
{'/path/to/file_chr1_3end.pkl.gz': 'data1', '/path/to/file_chr2_3end.pkl.gz': 'data2'}

trxtools.sam.SAMgeneral.stripCIGAR(match=[], to_strip='H')

Removes H from output of groupCIGAR

Parameters:

match (list) – output of groupCIGAR, defaults to []
to_strip (str, optional) – CIGAR mark to be stripped, defaults to “H”

Returns:

modified list of tuples

Return type:

list

trxtools.sam.SAMgeneral.stripSubstitutions(match)

Strip substutiotns on both ends

Parameters:: match (list) – output of groupCIGAR, defaults to []
Returns:: list of tuples
Return type:: list

trxtools.sam.SAMgeneral.tostripCIGARfive(match=[])

Calculates length of soft-clipped nucleotides at the 5’ end of the read

Parameters:: match (list) – output of groupCIGAR, defaults to []
Returns:: number of substituted nucleotides
Return type:: int

>>> tostripCIGARfive([('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')])
3

trxtools.sam.SAMgeneral.tostripCIGARthree(match=[])

Nucleotides without alignment at the 3’ end of the read

Parameters:: match (list) – output of groupCIGAR, defaults to []
Returns:: number of substituted nucleotides
Return type:: int

>>> tostripCIGARthree([('3', 'S'), ('44', 'M'), ('1', 'S')])
1

trxtools.sam.SAMgenome submodule

trxtools.sam.SAMgenome.chromosome2profile3end(l=[], length=0, strand='FWD')

Generates profile for the 3’ ends of reads and saves position of non-coded end

Parameters:

l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []
length (int) – length of chromosome
strand (str, optional) – 'FWD' or 'REV', defaults to ‘FWD’

Returns:

profile, noncoded

Return type:

np.array, list of tuples

>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD")
(0     0.0
1     0.0
...
34    0.0
35    1.0
36    0.0
...
50    0.0
dtype: float64,
[(35, 'CC')])
>>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV")
(0     0.0
1     0.0
...
39    0.0
40    1.0
41    0.0
...
50    0.0
dtype: float64,
[(40, 'AAA')])

trxtools.sam.SAMgenome.parseHeader(filename, name, dirPath)

Parses the header of a SAM file to extract chromosome names and lengths.

Parameters:

filename (str) – Path to the SAM file
name (str) – Name of the experiment
dirPath (str) – Directory path to save intermediate files

Returns:

DataFrame with chromosome names and lengths

Return type:

pd.DataFrame

Example:

>>> parseHeader("example.sam", "experiment1", "/path/to/dir")
chrName  length
chr1     248956422
chr2     242193529
...

trxtools.sam.SAMgenome.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='read', logName='')

Function used by sam2genome. Works for both strands.

Parameters:

name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes

Returns:

output_df_fwd, output_df_rev, log

Return type:

DataFrame, DataFrame, list

trxtools.sam.SAMgenome.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='3end', noncoded=True, ends='polyA', logName='', minLen=3)

Function used by sam2genome3end. Works for both strands.

Parameters:

name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes
noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True

Returns:

output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev

Return type:

DataFrame, DataFrame, list, DataFrame, DataFrame

trxtools.sam.SAMgenome.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='5end', logName='')

Function used by sam2genome5end. Works for both strands.

Parameters:

name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes

Returns:

output_df_fwd, output_df_rev, log

Return type:

DataFrame, DataFrame, list

trxtools.sam.SAMgenome.reads2genomeDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='del', expand=0, logName='')

Function used by sam2genome. Works for both strands.

Parameters:

name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes

Returns:

output_df_fwd, output_df_rev, log

Return type:

DataFrame, DataFrame, list

trxtools.sam.SAMgenome.sam2genome(filename='', path='', toClear='', chunks=0, use='3end', expand=0, noncoded=True, noncoded_suffix='polyA')

Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.

Parameters:

filename (str)
path (str)
toClear (str, optional) – element of filename to be removed, defaults to ‘’
chunks (int, optional) – Read SAM file in chunks, defaults to 0
noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True
noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False

trxtools.sam.SAMtranscripts submodule

trxtools.sam.SAMtranscripts.reads2profile(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])

Takes a list of aligned reads and transforms them into a profile for each transcript. Tested only for PLUS strand.

Parameters:

name (str) – The name of the file (without extension) containing the aligned reads.
dirPath (str) – The directory path where the file is located.
df_details (pandas.DataFrame) – A DataFrame containing details about the transcripts, including their lengths.

Returns:

A tuple containing the output DataFrame with profiles and a log list with status messages.

Return type:

tuple (pandas.DataFrame, list)

Example:

>>> output_df, log = reads2profile(name='aligned_reads', dirPath='/path/to/files', df_details=df_details)

trxtools.sam.SAMtranscripts.reads2profileDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], expand=5)

Takes a list of aligned reads and transforms them into profiles for each transcript, including deletions. Tested only for PLUS strand.

Parameters:

name (str) – The name of the file (without extension) containing the aligned reads.
dirPath (str) – The directory path where the file is located.
df_details (pandas.DataFrame) – A DataFrame containing details about the transcripts, including their lengths.
expand (int) – The number of bases to expand around deletions.

Returns:

A tuple containing the output DataFrames with profiles (reads, deletions, deletions with expansion) and a log list with status messages.

Return type:

tuple (pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, list)

Example:

>>> output_df, outputDel_df, outputDelExt_df, log = reads2profileDeletions(name='aligned_reads', dirPath='/path/to/files', df_details=df_details, expand=5)

trxtools.sam.SAMtranscripts.sam2profiles(filename='', path='', geneList=[], toClear='', df_details=Empty DataFrame Columns: [] Index: [], deletions=False, expand=5, pickle=False, chunks=0)

Function handling SAM files and generating profiles. Executed using wrapping script SAM2profiles.py.

Parameters:

filename (str) – The name of the SAM file to be processed.
path (str) – The directory path where the SAM file is located.
geneList (list) – List of transcripts to be selected.
toClear (str, optional) – Element of filename to be removed, defaults to ‘’.
df_details (pandas.DataFrame) – DataFrame containing details of transcripts.
deletions (bool, optional) – Whether to generate profile of deletions, defaults to False.
expand (int, optional) – Number of bases to expand around deletions, defaults to 5.
pickle (bool, optional) – Whether to save output in pickle format, defaults to False.
chunks (int, optional) – Number of chunks to read SAM file in, defaults to 0.

Returns:

None

Example:

>>> sam2profiles(filename="example.sam", path="/path/to/files", geneList=["gene1", "gene2"], df_details=df_details)

trxtools.sam.SAMtranscripts.transcript2profile(l=[], length=0)

Takes a list of tuples (position, CIGARstring) and generates a profile. Works with entire reads, for both strands.

Parameters:

l (list) – List of tuples where each tuple contains a position and a CIGAR string.
length (int) – The length of the profile to be generated.

Returns:

A pandas Series representing the profile, with positions as the index and counts as the values.

Return type:

pandas.Series

Example:

>>> l = [(1, '10M'), (2, '5M')]
>>> length = 10
>>> profile = transcript2profile(l, length)
>>> print(profile)
1    1.0
2    2.0
3    2.0
4    2.0
5    2.0
6    2.0
7    1.0
8    1.0
9    1.0
10   1.0
dtype: float64

trxtools.sam.SAMtranscripts.transcript2profileDeletions(l=[], expand=0, length=0)

Takes a list of tuples (position, CIGARstring) and generates a profile for deletions.

Parameters:

l (list) – List of tuples where each tuple contains a position and a CIGAR string.
expand (int) – The number of bases to expand around deletions.
length (int) – The length of the profile to be generated.

Returns:

A pandas Series representing the profile, with positions as the index and counts as the values.

Return type:

pandas.Series

Example:

>>> profile = transcript2profileDeletions(l, expand, length)

trxtools.BigWig module

trxtools.BigWig.foldingFromBigWig(gene_name, data_path, data_files, gtf, ranges=0, range5end=0, offset=15)

Extracts folding energy for a given gene from BigWig files, and applies a folding offset to the data.

Parameters:

gene_name (str) – Name of the gene to extract data for.
data_path (str) – Path to the directory containing BigWig files.
data_files (list) – List of BigWig file names to extract data from.
gtf (GTF) – GTF file object containing gene annotations.
ranges (int, optional) – Range of nucleotides to extract, defaults to 0.
range5end (int, optional) – Number of bases to exclude from the 5’ end, defaults to 0.
offset (int, optional) – Number of bases to shift the data by, defaults to 15.

Returns:

DataFrame containing nucleotide sequences and associated data with applied folding offset.

Return type:

pd.DataFrame

Example:

>>> gtf = GTF("/path/to/annotations.gtf")
>>> df = foldingFromBigWig("BRCA1", "/path/to/bigwig/", ["file1.bw", "file2.bw"], gtf, ranges=1000, range5end=100, offset=15)
>>> print(df.head())

trxtools.BigWig.geneFromBigWig(gene_name, data_path, data_files, gtf, ranges=0, verbose=False)

Extracts nucleotide sequences and associated data for a given gene from BigWig files.

Parameters:

gene_name (str) – Name of the gene to extract data for.
data_path (str) – Path to the directory containing BigWig files.
data_files (list) – List of BigWig file names to extract data from.
gtf (GTF) – GTF file object containing gene annotations.
ranges – Range of nucleotides to extract, defaults to 0.
verbose – If True, prints the name of each BigWig file being processed, defaults to False.

Returns:

DataFrame containing nucleotide sequences and associated data.

Return type:

pd.DataFrame

Example:

>>> gtf = GTF("/path/to/annotations.gtf")
>>> df = geneFromBigWig("BRCA1", "/path/to/bigwig/", ["file1.bw", "file2.bw"], gtf, ranges=1000, verbose=True)
>>> print(df.head())

trxtools.BigWig.getSeqData(gene_name, data_path, name, gtf, ranges=0)

Load BigWig data for a gene/chromosome region and return as pandas Series.

Parameters:

gene_name (str) – Name of the gene/chromosome to retrieve data for.
data_path (str) – Path to the directory containing BigWig files.
name (str) – Base name of the BigWig files (without strand suffix and extension). Can be obtained using strip_BigWig_names().
gtf (object) – GTF object that provides gene information such as strand, chromosome, and coordinates.
ranges – Number of bases to extend the region on both sides, defaults to 0.

Returns:

BigWig data for the specified gene or region as a pandas Series.

Return type:

pandas.Series

Example:

>>> gtf = GTF('/path/to/gtf_file.gtf')
>>> data = getSeqData('BRCA1', '/path/to/bigwig/', 'sample', gtf, ranges=100)
>>> print(data)

trxtools.BigWig.strip_BigWig_names(files=[])

Remove _fwd.bw and _rev.bw from the list of BigWig files

Parameters:: files (list, required) – list of filenames, defaults to list()
Returns:: list of unique names
Return type:: list

>>> strip_BigWig_names(["file1_fwd.bw", "file1_rev.bw", "file2_fwd.bw", "file2_rev.bw"])
['file2', 'file1']

trxtools.metaprofiles module

trxtools.metaprofiles.bed_split_strands(bed_df)

Splits a BED dataframe into two separate dataframes based on strand information.

Parameters:: bed_df (pandas.DataFrame) – A dataframe containing BED format data. The strand information is expected to be in the 6th column (index 5).
Returns:: A tuple containing two dataframes: - bed_plus (pandas.DataFrame): Dataframe containing entries with the ‘+’ strand. - bed_minus (pandas.DataFrame): Dataframe containing entries with the ‘-’ strand.
Return type:: tuple

trxtools.metaprofiles.binMultipleMatrices(mm={}, bins=[50, 10, 50], bed_df=Empty DataFrame Columns: [] Index: [], flank_5=None, flank_3=None)

Bin multiple matrices of tRNA profiles into a single dataframe

Parameters:

mm (dict) – dictionary of matrices of gene profiles
bins (list, optional) – list of three integers, defaults to [50, 10, 50]
bed_df (pandas.DataFrame, optional) – dataframe in BED format containing genomic coordinates of target regions, defaults to pd.DataFrame()
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to None
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to None

Raises:

ValueError – if bins is not a list of three integers
ValueError – if input_value is not an integer or a DataFrame

Returns:

dictionary of binned matrices of tRNA profiles

Return type:

dict

trxtools.metaprofiles.getMultipleMatrices(bw_paths_plus, bw_paths_minus, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True, align_3end=False)

Get score matrices for positions in given regions (with optional flanks) from multiple BigWig files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks.

Parameters:

bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – _description_, defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True
align_3end (bool, optional) – if True, align profiles at the 3’end of features. Otherwise align at 5’end, defaults to False

Returns:

A dictionary containing score matrices for individual BigWig files. Dictionary keys are BigWig file names.

Return type:

dict

trxtools.metaprofiles.getMultipleMatricesFromPeak(peak_paths=[], bed_df=<class 'pandas.core.frame.DataFrame'>, g='references/GRCh38.primary_assembly.genome.cleaned.len', flank_5=0, flank_3=0): Get score matrices for positions in given regions (with optional flanks) from multiple peak files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. :param peak_paths: list of paths to peak files (broadPeak or narrowPeak format) :type peak_paths: list :param bed_df: dataframe in BED format containing genomic coordinates of target regions :type bed_df: pandas.DataFrame :param g: Path to genome file (chromosome lengths), defaults to ‘references/GRCh38.primary_assembly.genome.cleaned.len’ :type g: str :param flank_5: length of 5’flank to extend BED regions by, defaults to 0 :type flank_5: int, optional :param flank_3: length of 3’flank to extend BED regions by, defaults to 0 :type flank_3: int, optional :return: dictionary of score matrices for each peak file :rtype: dict

trxtools.metaprofiles.get_bw_data(bed_row, bw, flank_5=0, flank_3=0, align_3end=False)

Retrieve BigWig scores for positions in a given region, optionally including flanks of given length.

Parameters:

bed_row (list or pandas.Series) – A row from a BED file, containing chromosome, start, end, and strand information.
bw (pyBigWig) – A pyBigWig object to retrieve values from.
flank_5 (int, optional) – Length of the 5’ flank to include, defaults to 0.
flank_3 (int, optional) – Length of the 3’ flank to include, defaults to 0.
align_3end (bool, optional) – Whether to align the series to the 3’ end, defaults to False.

Returns:

A pandas Series containing the BigWig scores for the specified region.

Return type:

pandas.Series

trxtools.metaprofiles.get_multiple_matrices(bw_paths_plus, bw_paths_minus, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True, align_3end=False)

WARNINIG: This function is replaced by getMultipleMatrices()

Get score matrices for positions in given regions (with optional flanks) from multiple BigWig files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks.

Parameters:

bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – _description_, defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True
align_3end (bool, optional) – if True, align profiles at the 3’end of features. Otherwise align at 5’end, defaults to False

Returns:

A dictionary containing score matrices for individual BigWig files. Dictionary keys are BigWig file names.

Return type:

dict

Warning

This function is deprecated and will be removed in future versions.

trxtools.metaprofiles.join_strand_matrices(plus_dict, minus_dict)

Combine score matrices for regions on + and - strands.

Parameters:

plus_dict (dict) – Dictionary containing score matrices for regions on the + strand.
minus_dict (dict) – Dictionary containing score matrices for regions on the - strand.

Raises:

Exception – If keys in dictionaries do not match.

Returns:

Dictionary containing combined score matrices for regions on both strands.

Return type:

dict

trxtools.metaprofiles.matrixFromBigWig(bw_path, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, align_3end=False)

Get matrix with BigWig scores for all regions in a bed df from a single BigWig file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.

Parameters:

bw_path (str) – Path to target BigWig file
bed_df (pandas.DataFrame) – DataFrame with BED data (use read_bed())
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to retrieved scores, defaults to None
align_3end (bool, optional) – If true, position 0 in the resulting matrix will be set at the target region’s 3’end instead of 5’end, defaults to False

Returns:

DataFrame with the result score matrix

Return type:

pandas.DataFrame

trxtools.metaprofiles.matrix_from_bigwig(bw_path, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, align_3end=False)

WARNINIG: This function is replaced by matrixFromBigWig()

Get matrix with BigWig scores for all regions in a bed df from a single BigWig file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.

Parameters:

bw_path (str) – Path to target BigWig file
bed_df (pandas.DataFrame) – DataFrame with BED data (use read_bed())
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to retrieved scores, defaults to None
align_3end (bool, optional) – If true, position 0 in the resulting matrix will be set at the target region’s 3’end instead of 5’end, defaults to False

Returns:

DataFrame with the result score matrix

Return type:

pandas.DataFrame

Warning

This function is deprecated and will be removed in future versions.

trxtools.metaprofiles.metaprofile(matrix_dict, agg_type='mean', normalize_internal=False, subset=None)

Calculate metaprofiles from score matrices by aggregating each position in all regions.

Parameters:

matrix_dict (dict) – Dict containing score matrices returned by get_multiple_matrices()
agg_type (str, optional) – Type of aggregation to use. Available: ‘mean’, ‘median’, ‘sum’, defaults to ‘mean’
normalize (bool, optional) – if true, normalize each profile internally (i.e. gene-wise) (x/sum(x)) before aggregating, defaults to False

Raises:

Exception – _description_

Returns:

dataframe containing metaprofile values for each position in each of the input matrices (i.e. bigwig files)

Return type:

pandas.DataFrame

trxtools.metaprofiles.peak2matrice(bed_df=<class 'pandas.core.frame.DataFrame'>, peak_file_path='', g='references/GRCh38.primary_assembly.genome.cleaned.len', flank_5=0, flank_3=0, fillna=True)

Get score matrices for positions in given regions (with optional flanks) from a single nroadPeak or narrowPeak file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.

Parameters:

bed_df (pandas.DataFrame) – DataFrame with BED data (can use read_bed())
peak_file_path (str) – Path to peak file (broadPeak or narrowPeak format)
g (str) – Path to genome file (chromosome lengths), defaults to ‘references/GRCh38.primary_assembly.genome.cleaned.len’
flank_5 (int, optional) – length of 5’flank to extend BED regions by, strand-aware, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, strand-aware, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True

Returns:

DataFrame with the result score matrix

Return type:

pandas.DataFrame

trxtools.metaprofiles.readBED(bed_path)

Simple BED file parser

Parameters:: bed_path (str) – Path to BED file
Returns:: Contents of BED file in DataFrame form
Return type:: pandas.DataFrame

trxtools.metaprofiles.read_bed(bed_path)

Simple BED file parser

Parameters:: bed_path (str) – Path to BED file
Returns:: Contents of BED file in DataFrame form
Return type:: pandas.DataFrame

Warning

This function is deprecated and will be removed in future versions.

trxtools.metaprofiles.regionScore(bw_paths_plus, bw_paths_minus, bed_df, agg_type='sum', flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True)

Calculate coverage or statistic for multiple regions in multiple BigWig files.

Parameters:

bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
agg_type (str, optional) – operation to perform on region scores. Available options: ‘sum’, ‘mean’, ‘median’, defaults to ‘sum’
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – If true, NaNs will be replaced with zeroes (recommended, as pyBigWig reports positions with 0 score as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True

Returns:

DataFrame with calculated scores. Rows are regions/genes, columns are BigWig files.

Return type:

pandas.DataFrame

trxtools.metaprofiles.selectSubsetMM(matrix_dict, subset=None)

Select a subset of regions from the matrix dictionary.

Parameters:

matrix_dict (dict) – Dictionary containing score matrices.
subset (Union[pd.DataFrame, list], optional) – DataFrame or list containing the subset of regions to select, defaults to None

Returns:

Dictionary containing the selected subset of score matrices.

Return type:

dict

trxtools.assays module

trxtools.assays.bulkInputIDT(data=Empty DataFrame Columns: [] Index: [])

Transforms dataframe with columns “template” and “non-template” and prepares table to be used as bulk input. Discards oligos longer than 200 nt.

Parameters:: data (DataFrame) – Table with columns “template” and “non-template”
Returns:: Table with sequences to order
Return type:: DataFrame

trxtools.assays.extruded(seq='', buried=13)

Returns the extruded sequence by removing the buried nucleotides from the end

Parameters:

seq (str) – sequence
buried (int) – length of sequence buried within RNAP

Returns:

extruded sequence

Return type:

str

trxtools.assays.findPrimer(seq='', query='')

Find query sequence within given sequence

Parameters:

seq (str) – sequence
query (str) – query within searched sequences

Returns:

(start,stop) in python indexing

Return type:

tuple

Example:

>>> findPrimer("ATCGATCG", "ATCG")
(0, 4)

trxtools.assays.nonTemplateDNA(seq='', primer='')

Returns the DNA sequence of the non-template strand.

Parameters:

seq (str) – sequence
primer (str) – primer sequence

Returns:

DNA sequence of non-template strand

Return type:

str

trxtools.assays.sequenceConstrain(structure='', stall='', RNAprimer='')

Returns sequence constrained with the 5prime RNA primer

Parameters:

structure (str) – sequence structure
stall (str) – single letter representing the stall sequence
RNAprimer (str) – RNA primer sequence

Returns:

sequence constraints for folding algorithm

Return type:

str

trxtools.assays.stalled(seq='', stall='AAA', primer='')

Finds and returns stalled sequence

Parameters:

seq (str) – sequence
stall (str) – stall sequence, default “AAA”
primer (str) – primer sequence

Returns:

stalled sequence

Return type:

str

trxtools.assays.structureFile(structure='', stall='', RNAprimer='')

Saves structure file in current directory

Parameters:

structure (str) – sequence structure
stall (str) – single letter representing the stall sequence
RNAprimer (str) – RNA primer sequence

Returns:

True

Return type:

bool

trxtools.assays.templateDNA(seq='', overhang5end='')

Returns the DNA sequence of the template strand. Takes reverse complement of RNA sequence and the 5’end DNA overhang as input.

Parameters:

seq (str) – sequence of RNA
overhang5end (str) – sequence of the 5’end DNA overhang

Returns:

DNA sequence of template strand

Return type:

str

trxtools.assays.testScaffold(data=Empty DataFrame Columns: [] Index: [], overhang5end='', RNA_primer='')

Prints scaffold to test before ordering

Parameters:

data (DataFrame) – DataFrame with columns “template” and “non-template”
overhang5end (str) – sequence of the 5’end DNA overhang
RNA_primer (str) – RNA primer sequence

Returns:

None

trxtools.assays.toOrder(data=Empty DataFrame Columns: [] Index: [], buried='GUCUGUUUUGUGG', stallFull='AAA', afterStall='TGATCGGTAC', overhang5end='TGA', RNA_primer='AGGCCGAAA', bulkInput=True, test=True, lengthMax=200)

Prepares sequences for ordering by applying folding functions to DNA sequences

Parameters:

data (DataFrame) – DataFrame with columns “sequence” and “name”
buried (str) – sequence buried within RNAP, default “GUCUGUUUUGUGG”
stallFull (str) – stall sequence, default “AAA”
afterStall (str) – sequence after stall, default “TGATCGGTAC”
overhang5end (str) – sequence of the 5’end DNA overhang, default “TGA”
RNA_primer (str) – RNA primer sequence, default “AGGCCGAAA”
bulkInput (bool) – whether to prepare bulk input for ordering, default True
test (bool) – whether to print scaffold for testing before ordering, default True
lengthMax (int) – maximum length of sequences to order, default 200

Returns:

DataFrame of sequences to order

Return type:

DataFrame

trxtools.go_enrichment module

trxtools.go_enrichment.get_enrichment(query_genes, organism, go_dataset='biological_process', use_reference_set=False, ref_genes=None, ref_organism=None, test_type='FISHER', correction='FDR')

Run a GO term enrichment test using PANTHER API

Parameters:

query_genes (list) – List of sequence identifiers of queried genes (e.g. transcript ids, gene ids)
organism (str) – Taxid of query species (e.g. “9606” for H. sapiens)
go_dataset (str) – Which annotation dataset to query, “biological_process” or “molecular_function”
use_reference_set (bool) – Use a custom set of rerence (background) genes? Default False. If True, ref_genes and ref_species need to be specifed.
ref_genes (list) – optional list of reference genes. Specifying None (default) will use the whole genome of species specified in organism. When passing a list, ref_organism taxid must also be provided.
ref_species (str) – Taxid of reference species, required when ref_genes is not None
test_type (str) – Which tatistical test to use. Available: “FISHER” (default), “BINOMIAL”
correction (str) – Which multiple testing correction method to use. Available: “FDR” (default), “BONFERRONI”, “NONE”

Returns:

Unfiltered DataFrame of results.

Return type:

pandas.DataFrame:

trxtools.methods module

trxtools.methods.DNA_stretch_positions(df, char, min_len, max_len, name_col='name', seq_col='sequence')

Finds all character stretches of given length in a string (i.e. DNA sequence) and reports their positions. Wrapper for DNA_string_positions()

Parameters:

df (pandas.DataFrame) – Input dataframe
char (str) – Query character (base)
min_len (int) – minimum length of stretch to look for
min_len – maximum length of stretch to look for
name_col (str, optional) – name of column containing sequence names, defaults to ‘name’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’

Returns:

Dataframe in long format containing positions of found string

Return type:

pandas.DataFrame

trxtools.methods.DNA_string_positions(df, string, name_col='name', seq_col='sequence')

Finds all occurences of a given substring in a string (i.e. DNA sequence) and reports their positions

Parameters:

df (pandas.DataFrame) – Input dataframe
string (str) – Query string
name_col (str, optional) – name of column containing sequence names, defaults to ‘name’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’

Returns:

Dataframe in long format containing positionsof found string

Return type:

pandas.DataFrame

trxtools.methods.addCluster(df=Empty DataFrame Columns: [] Index: [], n=10)

Assigns n clusters to the data using KMeans algorithm

Parameters:

df (pandas.DataFrame) – DataFrame containing the data to be clustered
n (int, optional) – Number of clusters to form, defaults to 10

Returns:

DataFrame with an additional ‘cluster’ column indicating cluster assignment

Return type:

pandas.DataFrame

Example:

>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [5, 4, 3, 2, 1]})
>>> addCluster(df, n=2)
   x  y  cluster
0  1  5        1
1  2  4        1
2  3  3        0
3  4  2        0
4  5  1        0

trxtools.methods.bashCommand(bashCommand='')

Run command in bash using subprocess.call()

Parameters:: bashCommand – str with command
Returns:: None

trxtools.methods.bed2len(bed=Empty DataFrame Columns: [] Index: [])

Convert a bed DataFrame to a length DataFrame.

Parameters:: bed (pandas.DataFrame) – pandas DataFrame containing bed data.
Returns:: pandas DataFrame with length data.
Return type:: pandas.DataFrame
Example:

>>> bed = pd.DataFrame({'chr': ['chr1', 'chr1'], 'start': [100, 200], 'stop': [150, 250], 'region': ['region1', 'region2']})
>>> bed2len(bed)
region
region1    50
region2    50
dtype: int64

trxtools.methods.calGC(dataset=Empty DataFrame Columns: [] Index: [], calFor=['G', 'C'])

Returns GC content in a given string - uses [‘nucleotide’] column

Parameters:

dataset (pandas.DataFrame) – DataFrame() with “nucleotide” column
calFor (list) – list of nucleotides to calculate GC content

Returns:

fraction of GC content, float

Return type:

float

trxtools.methods.cleanNames(data, strings=[])

Cleans the names in the given data by removing specified strings.

Parameters:

data (dict or pandas.DataFrame) – The data to be cleaned. It can be either a dictionary or a pandas DataFrame.
strings (list, optional) – A list of strings to be removed from the names. Defaults to an empty list.

Returns:

The cleaned data with names modified according to the specified strings.

Return type:

dict or pandas.DataFrame

trxtools.methods.define_experiments(paths_in, whole_name=False, strip='_hittable_reads.txt')

Parse file names and extract experiment name from them

Parameters:

paths_in (list) – List of file paths
whole_name (bool, optional) – Whether to use the whole file name as the experiment name, defaults to False
strip (str, optional) – String to strip from the file name, defaults to ‘_hittable_reads.txt’

Returns:

List of experiment names and list of paths

Return type:

tuple

trxtools.methods.enriched(df, padj=0.05, fc=2)

Filter DataFrame for enriched genes based on adjusted p-value and log2 fold change.

Parameters:

df (pandas.DataFrame) – DataFrame containing gene expression data
padj (float, optional) – Adjusted p-value threshold, defaults to 0.05
fc (float, optional) – Log2 fold change threshold, defaults to 2

Returns:

Filtered DataFrame with enriched genes

Return type:

pandas.DataFrame

trxtools.methods.expNameParser(name, additional_tags=[], order='b_d_e_p')

Function handles experiment name; recognizes AB123456 as experiment date; BY4741 or HTP or given string as bait protein

Parameters:

name (str) – Experiment name
additional_tags (list) – List of additional tags
order (str, optional) – Order of elements in the output, defaults to ‘b_d_e_p’

Returns:

Reordered experiment name

Return type:

str

trxtools.methods.expStats(input_df=Empty DataFrame Columns: [] Index: [], smooth=True, window=10, win_type='blackman')

Returns DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments

Parameters:

input_df (pandas.DataFrame) – DataFrame containing the input data
smooth (bool, optional) – Whether to apply smoothing window, defaults to True
window (int, optional) – Smoothing window size, defaults to 10
win_type (str, optional) – Type of smoothing window, defaults to ‘blackman’

Returns:

DataFrame with calculated statistics

Return type:

pandas.DataFrame

trxtools.methods.filterExp(datasets, let_in=[''], let_out=['wont_find_this_string'], verbose=False)

Returns object with filtered columns/keys.

Parameters:

datasets – DataFrame or dict with exp name as a key
let_in (list) – List of elements of name to filter in
let_out (list) – List of elements of name to filter out
verbose (bool, optional) – If True, prints the filtered columns/keys, defaults to False

Returns:

DataFrame or dict with filtered columns/keys

Return type:

pandas.DataFrame or dict

trxtools.methods.find_pol3_terminators(df, min_T, max_T, name_col='name', seq_col='sequence')

Finds all Pol3 terminators (T stretches) of given length in a string (i.e. DNA sequence) and reports their positions. Wrapper for DNA_string_positions()

Parameters:

df (pandas.DataFrame) – Input dataframe
min_T (int) – minimum length of T stretch to look for
min_T – maximum length of T stretch to look for
region_col (str, optional) – name of column containing sequence names, defaults to ‘region’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’

Returns:

Dataframe in long format containing positionsof found string

Return type:

pandas.DataFrame

trxtools.methods.groupCRACsamples(df=<class 'pandas.core.frame.DataFrame'>, use='protein', toDrop=[])

Parse CRAC names and annotates them using one of the following features: [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’, ‘sampleRep’]

Parameters:

df (pandas.DataFrame) – DataFrame with CRAC names as index
use (str, optional) – Feature to annotate, choose from [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’, ‘sampleRep’], defaults to ‘protein’
toDrop (list, optional) – List of words in CRAC name that will qualify the sample for rejection, defaults to []

Returns:

DataFrame with added column [‘group’]

Return type:

pandas.DataFrame

trxtools.methods.indexOrder(df=Empty DataFrame Columns: [] Index: [], additional_tags=[], output='root', order='b_d_e_p')

Apply expNameParser to whole DataFrame

Parameters:

df (pandas.DataFrame) – DataFrame where names of columns are names of experiments
additional_tags (list) – List of additional tags to consider in expNameParser
output (str) – Not used in the function, kept for compatibility
order (str, optional) – Order of elements in the output, defaults to ‘b_d_e_p’

Returns:

DataFrame with new names

Return type:

pandas.DataFrame

trxtools.methods.is_inside(inner_start, inner_end, outer_start, outer_end)

Check if one region is inside another string.

Parameters:

inner_start (int) – Start position of the inner region
inner_end (int) – End position of the inner region
outer_start (int) – Start position of the outer region
outer_end (int) – End position of the outer region

Returns:

True if the inner region is inside the outer region, False otherwise

Return type:

bool

Example:

>>> is_inside(1, 3, 0, 5)
True
>>> is_inside(1, 3, 0, 2)
False

trxtools.methods.letterContent(seq='', letter='A')

Calculates the content of a given letter in a sequence

Parameters:

seq (str) – Sequence to calculate the content of the given letter in
letter (str) – letter to calculate the content of in the sequence, defaults to “A”

Returns:

Content of the given letter in the sequence, round to two decimal places

Return type:

float

Example:

>>> letterContent("ATCG", "A")
0.25

trxtools.methods.listPaths(folder='.', suffix='', nameElem=None)

List all file paths in a folder with a specific suffix and optional name element.

Parameters:

folder (str, optional) – Folder to list files from, defaults to “.”
suffix (str, optional) – Suffix of the files to list, defaults to an empty string
nameElem (str, optional) – Optional element that must be in the file name, defaults to None

Returns:

List of file paths that match the criteria

Return type:

list

Example:

>>> listPaths(folder=".", suffix=".txt", nameElem="file")
['file1.txt', 'file2.txt']

trxtools.methods.loadGTF(gtf_path='')

Load GTF file into a DataFrame

Parameters:: gtf_path (str, optional) – Path to the GTF file, defaults to “”
Returns:: DataFrame containing GTF data
Return type:: pandas.DataFrame

trxtools.methods.nested_region_cleanup(df, start_col='start', end_col='end')

Helper function to remove shorter regions nested in longer ones

Parameters:

df (pandas.DataFrame) – input dataframe
start_col (str, optional) – name of column containing region start positions, defaults to ‘start’
end_col (str, optional) – name of column containing region end positions, defaults to ‘end’
length_col (str, optional) – name of column containing region lengths, defaults to ‘length’

Returns:

dataframe with nested regions removed

Return type:

pandas.DataFrame

trxtools.methods.normalize(df=<class 'pandas.core.frame.DataFrame'>, log2=False, pseudocounts=0.1, CPM=True)

Normalize the DataFrame by adding pseudocounts, dividing by the sum, and optionally applying log2 transformation and CPM scaling.

Parameters:

df (pandas.DataFrame) – DataFrame to be normalized
log2 (bool, optional) – Whether to apply log2 transformation, defaults to False
pseudocounts (float, optional) – Value to add to each element to avoid log of zero, defaults to 0.1
CPM (bool, optional) – Whether to scale the data to counts per million, defaults to True

Returns:

Normalized DataFrame

Return type:

pandas.DataFrame

trxtools.methods.parseCRACname(s1=<class 'pandas.core.series.Series'>)

Parse CRAC name into [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’] using this order.: “_” is used to split the name

Parameters:: s1 (pandas.Series) – Series containing CRAC names
Returns:: DataFrame with parsed CRAC names
Return type:: pandas.DataFrame

trxtools.methods.quantileCategory(s1=Series([], dtype: float64), q=4)

Quantile-based discretization function based on pandas.qcut function.

Parameters:

s1 (pandas.Series) – Series to be discretized
q (int, optional) – Number of quantiles, defaults to 4

Returns:

Series with quantile categories

Return type:

pandas.Series

Example:

>>> quantileCategory(pd.Series([1, 2, 3, 4, 5]), q=2)
     0
     0
     1
     1
     1
dtype: int64

trxtools.methods.randomDNAall(length=0, letters='CGTA')

Generates all possible random sequences of a given length

Parameters:

length (int) – Length of the sequence
letters (str) – str with letters that will be used

Returns:

List of all possible random sequences of the given length

Return type:

list

Example:

>>> randomDNAall(2,"CGTA")
['CC', 'CG', 'CT', 'CA', 'GC', 'GG', 'GT', 'GA', 'TC', 'TG', 'TT', 'TA', 'AC', 'AG', 'AT', 'AA']

trxtools.methods.randomDNAsingle(length=0, letters='CGTA')

Random generator of nucleotide sequence

Parameters:

length (int) – Length of the sequence
letters (str) – str with letters that will be used

Returns:

Random sequence of the given length

Return type:

str

Example:

>>> randomDNAsingle(10,"CGTA")
'CGTACGTAAC'

trxtools.methods.readSalmon(nameElem='', path='', toLoad='', toClear=[], toAdd='', column='NumReads', df=None, overwrite=False)

Reads multiple Salmon quant.sf files to one DataFrame

Parameters:

nameElem (str) – Element to load, present in all files
path (str) – Path to directory with files
toLoad (str, optional) – Additional parameter for filtering, defaults to nameElem
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
column (str, optional) – Column to extract from quant.sf file, defaults to ‘NumReads’
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False

Returns:

DataFrame with all files, where columns are values from ‘Name’ column

Return type:

pandas.DataFrame

trxtools.methods.read_DEseq(p)

WARNING: Not tested properly. May not work as expected. Read DESeq2 output file and add gene names.

Parameters:: p (str) – Path to the DESeq2 output file
Returns:: DataFrame with DESeq2 results and gene names
Return type:: pandas.DataFrame

trxtools.methods.read_HTSeq_output(path='', toLoad='classes', toClear=[], toAdd='', df=None, overwrite=False)

Reads multiple HTSeq tab files to one DataFrame

Parameters:

path (str) – Path to directory with files
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False

Returns:

DataFrame with all files, where columns are values from ‘name’ column

Return type:

pandas.DataFrame

trxtools.methods.read_STARstats(path='', toClear=[], toAdd='', df=None, overwrite=False)

Reads multiple STAR Log final output files to one DataFrame

Parameters:

path (str) – Path to directory with files
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False

Returns:

DataFrame with all files, where columns are values from ‘name’ column

Return type:

pandas.DataFrame

trxtools.methods.read_featureCount(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)

Read featureCount files with common first column

Parameters:

nameElem (str) – Element to load, present in all files
path (str) – Path to directory with files
toLoad (str, optional) – Additional parameter for filtering, defaults to nameElem
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False

Returns:

DataFrame with all files, where columns are values from ‘name’ column

Return type:

pandas.DataFrame

trxtools.methods.read_list(filepath='')

Read list from file. Each row becomes item in the list.

Parameters:: filepath (str) – Path to the file
Returns:: List of items from the file
Return type:: list
Example:

>>> read_list("file.txt")
['item1', 'item2']

trxtools.methods.read_tabFile(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)

Read tab files with common first column

Parameters:

nameElem (str) – part of a filename which is present in all files
path (str) – path to directory with files
toLoad (str) – to be present in file name
toClear (str) – part of filename to be removed from file name
toAdd (str) – string to be added to file name
df (pandas.DataFrame) – DataFrame, to be appended; default=None
overwrite (bool) – boolean, allows for overwriting during appending, default = False

Returns:

dataframe with all files, where columns are values from ‘name’ column

Return type:

pandas.DataFrame

trxtools.methods.reverse_complement(seq)

Reverse complement of DNA or RNA sequence. Identifies RNA by presence of ‘U’ in the sequence.

Parameters:: seq (str) – DNA or RNA sequence to reverse complement
Returns:: Reverse complement of the input sequence
Return type:: str
Example:

>>> reverse_complement("AUCG")
'CGAU'

trxtools.methods.reverse_complement_DNA(seq)

Reverse complement of DNA sequence

Parameters:: seq (str) – DNA sequence to reverse complement
Returns:: Reverse complement of the input sequence
Return type:: str
Example:

>>> reverse_complement_DNA("ATCG")
'CGAT'

trxtools.methods.reverse_complement_RNA(seq)

Reverse complement of RNA sequence

Parameters:: seq (str) – RNA sequence to reverse complement
Returns:: Reverse complement of the input sequence
Return type:: str
Example:

>>> reverse_complement_RNA("AUCG")
'CGAU'

trxtools.methods.rollingGC(s=<class 'pandas.core.series.Series'>, window=10)

Calculates GC from sequence, uses ‘boxcar’ window

Parameters:

s (pandas.Series) – Series containing sequence
window (int) – window size for GC calculation

Returns:

Series with GC calculated, center=False

Return type:

pandas.Series

trxtools.methods.runPCA(data=Empty DataFrame Columns: [] Index: [], n_components=2)

Run PCA analysis and re-assigns column names and index names

Parameters:

data (pandas.DataFrame) – DataFrame containing the input data
n_components (int, optional) – Number of principal components to compute, defaults to 2

Returns:

Tuple consisting of DataFrame with PCA results and a list of explained variance ratios

Return type:

tuple

Example:

>>> runPCA(pd.DataFrame([[1, 2], [3, 4], [5, 6]]), n_components=2)
(          PC1       PC2
0 -4.242641  0.000000
1  0.000000  0.000000
2  4.242641  0.000000, [100.0, 0.0])

trxtools.methods.timestamp()

Returns current timestamp as a string

Returns:: timestamp in a format ‘YYYYMMDD_HHMMSS’
Return type:: str

trxtools.methods.timestampRandomInt()

Returns current timestamp with a random integer as a string

Returns:: timestamp in a format ‘YYYYMMDD_HHMMSS_RANDOMINT’
Return type:: str

trxtools.nascent module

class trxtools.nascent.Fold(tempDir=None)

Bases: object

RNAfold(data, saveData=False, temp=None)

Calculates dG using RNAfold (ViennaRNA).

Parameters:

data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.

Returns:

DataFrame with calculated dG values.

Return type:

pd.DataFrame

RNAinvert(structure='', saveData=False, temp=None, n=5, RNAprimer='', stall='', quick=False)

Returns n sequences with a given structure.

Parameters:

structure (str) – Secondary RNA structure.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.
n (int) – Number of output sequences, default is 5.
RNAprimer (str) – RNA primer sequence.
stall (str) – Nucleotide sequence to avoid.
quick (bool) – If False, uses -Fmp -f 0.01 params, default is False.

Returns:

DataFrame with generated sequences.

Return type:

pd.DataFrame

RNAinvertStall(structure='', RNAprimer='', stall='A', n=200)

Returns sequences without the nucleotide that is present in stall.

Parameters:

structure (str) – Secondary RNA structure.
RNAprimer (str) – RNA primer sequence.
stall (str) – Nucleotide sequence to avoid, default is “A”.
n (int) – Number of sequences to generate, default is 200.

Returns:

DataFrame with filtered sequences.

Return type:

pd.DataFrame

UNAfold(data, saveData=False, temp=None)

Calculates dG using UNAfold.

Parameters:

data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.

Returns:

DataFrame with calculated dG values.

Return type:

pd.DataFrame

bashFolding(method='RNA')

Runs RNA folding using bash.

Parameters:: method (str) – Folding method, “RNA” for ViennaRNA or “UNA” for UNAfold.
Returns:: None

class trxtools.nascent.Hybrid(tempDir=None)

Bases: object

RNAhybrid(data, saveData=False, temp=None)

Calculates dG using hybrid-min. This function calculates the hybridization energy (dG) between RNA sequences using the hybrid-min tool.

Parameters:

data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for hybridization, default is None.

Returns:

DataFrame with calculated dG values.

Return type:

pd.DataFrame

Example:

>>> hybrid = Hybrid()
>>> data = ["AGCUAGUCA", "CGAUCGUAG"]
>>> hybrid.RNAhybrid(data)

bashHybrid()

Runs hybrid-min using bash. This function executes the hybrid-min command using bash to calculate the hybridization energy between two sequences.

Returns:: None

trxtools.nascent.extendingWindow(sequence='', name='name', strand='plus', temp=30, m=7, toAdd=0)

Returns DataFrame of sequences of all possible lengths between minimum (m) and length of input sequence -1.

Parameters:

sequence (str) – str, input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “minus”}, default “plus”
temp (int) – temperature, default 30
m (int) – minimum length of sequence, default 7
toAdd (int) – additional length to add, default 0

Returns:

DataFrame of sequences with names according to nascent.slidingWindow convention

Return type:

pd.DataFrame

trxtools.nascent.foldNascentElem(data=Empty DataFrame Columns: [] Index: [])

Fold the very 3’ of nascent elements.

Parameters:: data (pd.DataFrame) – DataFrame containing folded sequences.
Returns:: DataFrame with additional columns for nascent element folding energies.
Return type:: pd.DataFrame

trxtools.nascent.handleInput(data, keepNames=True)

Input data with columns: “seq” or “sequence” and “name” (optional).

Parameters:

data ({str, list, pd.Series, pd.DataFrame}) – input data {str, list, Series, DataFrame}
keepNames (bool) – if True use given names, default True

Returns:

DataFrame where index become name of sequence to fold

Return type:

pd.DataFrame

trxtools.nascent.join2d(df=Empty DataFrame Columns: [] Index: [], use='format')

Joins 2D data from RNA folding results.

Parameters:

df (pd.DataFrame) – DataFrame containing RNA folding results.
use (str) – Column to use for joining, default is ‘format’.

Returns:

Dictionary of DataFrames, one for each gene/chromosome.

Return type:

dict

trxtools.nascent.merge2d(df=<class 'pandas.core.frame.DataFrame'>)

Merges 2D data from RNA folding results.

Parameters:: df (pd.DataFrame) – DataFrame containing RNA folding results.
Returns:: DataFrame with statistical description of the merged data.
Return type:: pd.DataFrame

trxtools.nascent.name2index(s1=Series([], dtype: object))

Extracts position from sequence name.

Parameters:: s1 (pd.Series) – Series with names from prepareNascent function
Returns:: Series with positions
Return type:: pd.Series

trxtools.nascent.nascentElems(vienna='', sequence='')

Describe elements of secondary structure: stems, multistems.

Parameters:

vienna (str) – Vienna RNA secondary structure notation.
sequence (str) – RNA sequence.

Returns:

Tuple containing the sequence of the last nascent element and its distance from the end.

Return type:

tuple

trxtools.nascent.nascentElemsDataFrame(data=Empty DataFrame Columns: [] Index: [])

Apply nascentElems to DataFrame of folded sequences.

Parameters:: data (pd.DataFrame) – DataFrame containing folded sequences with ‘vienna’ and ‘sequence’ columns.
Returns:: DataFrame with additional columns for nascent element sequences and distances.
Return type:: pd.DataFrame

trxtools.nascent.nascentFolding(sequence='', temp=30, window=100)

Combines folding function: fold RNA, locate last nascent element and calculate dG of it.

Parameters:

sequence (str) – RNA sequence.
temp (int) – Folding temperature, default is 30.
window (int) – Window size for folding, default is 100.

Returns:

DataFrame with folding results and nascent element information.

Return type:

pd.DataFrame

trxtools.nascent.parseFoldingName(df=Empty DataFrame Columns: [] Index: [])

Parse folding names into separate columns.

Parameters:: df (pd.DataFrame) – DataFrame containing folding names.
Returns:: DataFrame with parsed columns.
Return type:: pd.DataFrame

trxtools.nascent.prepareNascent(sequence='', name='name', strand='plus', temp=30, window=100)

Divide long transcript into short sequences. Combines output of extendingWindow and slidingWindow.

Parameters:

sequence (str) – input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “minus”}, default “plus”
temp (int) – temperature, default 30
window (int) – window size, default 100

Returns:

DataFrame with sequences

Return type:

pd.DataFrame

trxtools.nascent.slidingWindow(sequence='', name='name', strand='plus', temp=30, window=100)

Slices sequence using sliding window.

Parameters:

sequence (str) – input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “both”, “minus”}, default “plus”
temp (int) – temperature, default 30
window (int) – window size, default 100

Returns:

DataFrame with sliding windows

Return type:

pd.DataFrame

trxtools.plotting module

trxtools.plotting.GOterm(df, x='fdr', y='term.label', normalizer='pValue', cutoff=0.05, count=20, figsize=(4, 4), dpi=300, title=None, fname=None)

Generate a bar plot for GO terms based on significance.

Parameters:

df (pd.DataFrame) – Input DataFrame containing GO term data.
x (str, optional) – The column name for the x-axis values (e.g., FDR). Defaults to ‘fdr’.
y (str, optional) – The column name for the y-axis values (e.g., term label). Defaults to ‘term.label’.
normalizer (str, optional) – The column name used for normalization (e.g., p-value). Defaults to ‘pValue’.
cutoff (float, optional) – The cutoff value for significance. Defaults to 0.05.
count (int, optional) – The number of top terms to display. Defaults to 20.
figsize (tuple, optional) – The figure size. Defaults to (4, 4).
dpi (int, optional) – The resolution of the saved figure. Defaults to 300.
title (str, optional) – The title of the plot. Defaults to None.
fname (str, optional) – The file path to save the figure. Defaults to None.

Returns:

None

Example:

>>> GOterm(df, x='fdr', y='term.label', normalizer='pValue', cutoff=0.05, count=20, title='GO Term Plot')

trxtools.plotting.PCA(data=Empty DataFrame Columns: [] Index: [], names=[], title='', PClimit=1, figsize=(7, 7), PCval=[])

Plot PCA plot.

Parameters:

data (pd.DataFrame) – DataFrame containing PCA results.
names (list, optional) – List of names to annotate.
title (str, optional) – Title of the plot.
PClimit (int, optional) – Number of principal components to plot. Defaults to 1.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 7).
PCval (list, optional) – List of explained variance for each principal component. Defaults to an empty list.

Returns:

None

Example:

>>> PCA(data, names=['Sample1', 'Sample2'], title='PCA Plot', PClimit=2)

trxtools.plotting.boxplot1(data, labels=None, title='', figsize=(7, 6), dpi=150, log=False, lim=None, name_cmap='Spectral', vert=1, color=None, grid=False, fname=None)

Generate a box plot.

Parameters:

data (list or pd.DataFrame or pd.Series) – Data to plot.
labels (list, optional) – Labels for the data. Defaults to None.
title (str, optional) – Title of the plot. Defaults to an empty string.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 6).
dpi (int, optional) – Resolution of the figure. Defaults to 150.
log (bool, optional) – Whether to use a logarithmic scale. Defaults to False.
lim (tuple, optional) – Limits for the y-axis. Defaults to None.
name_cmap (str, optional) – Name of the colormap to use. Defaults to ‘Spectral’.
vert (int, optional) – Whether to plot the boxes vertically. Defaults to 1.
color (str, optional) – Color of the boxes. Defaults to None.
grid (bool, optional) – Whether to display a grid. Defaults to False.
fname (str, optional) – File path to save the figure. Defaults to None.

Returns:

None

Example:

>>> boxplot1(data, labels=['A', 'B', 'C'], title='Box Plot', log=True)

trxtools.plotting.clusterClusterMap(df)

Generate a clustermap for clusters.

Parameters:: df (pd.DataFrame) – DataFrame containing cluster data.
Returns:: None
Example:

>>> clusterClusterMap(df)

trxtools.plotting.cumulativeDifference(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, fill_between=True, title='', start=None, stop=None, window=50, figsize=(4, 3), equal_weight=False, color1='green', lc='red', fname=None, use='mean', ylim=None)

Plot cumulative differences for peaks metaplot.

Parameters:

ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df2 (pd.DataFrame) – DataFrame containing the second dataset.
local_pos (list) – List of positions of features (peaks/troughs).
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
fill_between (bool, optional) – Whether to fill the area between the plot and the x-axis. Defaults to True.
title (str, optional) – Title of the plot. Defaults to an empty string.
start (int, optional) – Start position for filtering local positions. Defaults to None.
stop (int, optional) – Stop position for filtering local positions. Defaults to None.
window (int, optional) – Window size around each feature. Defaults to 50.
figsize (tuple, optional) – Size of the figure. Defaults to (4, 3).
equal_weight (bool, optional) – Whether to normalize the data to equal weight. Defaults to False.
color1 (str, optional) – Color for the difference plot. Defaults to ‘green’.
lc (str, optional) – Color for the vertical line at position 0. Defaults to ‘red’.
fname (str, optional) – File path to save the figure. Defaults to None.
use (str, optional) – Method to aggregate data (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
ylim (tuple, optional) – Limits for the y-axis. Defaults to None.

Returns:

None

Example:

>>> cumulativeDifference('reference.csv', df2, local_pos=[100, 200, 300], title='Cumulative Differences')

trxtools.plotting.cumulativePeaks(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, title='', start=None, stop=None, window=50, figsize=(4, 3), equal_weight=False, color1='green', color2='magenta', lc='red', fname=None, use='mean', ylim=None)

Plot cumulative peaks metaplot for a single gene.

Parameters:

ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df2 (pd.DataFrame) – DataFrame containing the second dataset.
local_pos (list) – List of positions of features (peaks/troughs).
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
title (str, optional) – Title of the plot. Defaults to an empty string.
start (int, optional) – Start position for filtering local positions. Defaults to None.
stop (int, optional) – Stop position for filtering local positions. Defaults to None.
window (int, optional) – Window size around each feature. Defaults to 50.
figsize (tuple, optional) – Size of the figure. Defaults to (4, 3).
equal_weight (bool, optional) – Whether to normalize the data to equal weight. Defaults to False.
color1 (str, optional) – Color for the reference dataset plot. Defaults to ‘green’.
color2 (str, optional) – Color for the second dataset plot. Defaults to ‘magenta’.
lc (str, optional) – Color for the vertical line at position 0. Defaults to ‘red’.
fname (str, optional) – File path to save the figure. Defaults to None.
use (str, optional) – Method to aggregate data (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
ylim (tuple, optional) – Limits for the y-axis. Defaults to None.

Returns:

None

Example:

>>> cumulativePeaks('reference.csv', df2, local_pos=[100, 200, 300], title='Cumulative Peaks')

trxtools.plotting.enhancedVolcano(df, x_fc='log2FoldChange', y_pval='padj', pval_cutoff=0.01, fc_cutoff=2, plot_n=True, labels=True, xlim=None, ylim=None, figsize=(4, 4), dpi=300, title=None, save=None)

Generate an enhanced volcano plot based on DataFrame values.

Parameters:

df (pd.DataFrame) – Input DataFrame containing the data.
x_fc (str, optional) – The column name for the x-axis values (fold change). Defaults to ‘log2FoldChange’.
y_pval (str, optional) – The column name for the y-axis values (adjusted p-value). Defaults to ‘padj’.
pval_cutoff (float, optional) – The p-value cutoff for significance. Defaults to 0.01.
fc_cutoff (float, optional) – The fold change cutoff for significance. Defaults to 2.
plot_n (bool, optional) – Whether to plot non-significant points. Defaults to True.
labels (bool, optional) – Whether to add labels to significant points. Defaults to True.
xlim (tuple, optional) – The x-axis limits. Defaults to None.
ylim (tuple, optional) – The y-axis limits. Defaults to None.
figsize (tuple, optional) – The figure size. Defaults to (4, 4).
dpi (int, optional) – The resolution of the saved figure. Defaults to 300.
title (str, optional) – The title of the plot. Defaults to None.
save (str, optional) – The file path to save the figure. Defaults to None.

Returns:

None

Example:

>>> enhancedVolcano(df, x_fc='log2FoldChange', y_pval='padj', pval_cutoff=0.01, fc_cutoff=2, title='Volcano Plot')

trxtools.plotting.generateSubplots(dataframes, figsize=(5, 3), dpi=300, save=None)

Generate subplots for multiple dataframes.

Parameters:

dataframes (list) – A list of dataframes to plot. Each element is a list of lists containing title, dataframe, and optional colormap.
figsize (tuple, optional) – The size of the figure. Defaults to (5, 3).
dpi (int, optional) – The resolution of the figure in dots per inch. Defaults to 300.
save (str, optional) – The file path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> df1 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
>>> df2 = pd.DataFrame({'x': [7, 8, 9], 'y': [10, 11, 12]})
>>> generateSubplots([
>>>     [['Title 1', df1, None], 
>>>      ['Title 2', df2, None]]
>>> ], figsize=(10, 6), dpi=150, save='plot.png')

trxtools.plotting.hplotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot the average input read length as a horizontal bar plot.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> hplotSTARstats_readLen(df)

trxtools.plotting.hplotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot the number of reads after pre-processing as a horizontal bar plot.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> hplotSTARstats_reads(df)

trxtools.plotting.metaprofileAndHeatmap(data_metaplot, data_heatmap, subsets={'title': None}, agg_type='mean', normalize_internal=True, differential_heatmap=True, figsize=(5, 3), dpi=300, save=None, bins=[50, 10, 50])

Generate metaprofile and heatmap plots for multiple subsets.

Parameters:

data_metaplot (dict) – Data for metaprofile plot.
data_heatmap (dict) – Data for heatmap plot.
subsets (dict, optional) – Dictionary of subsets with titles as keys and subset data as values. Defaults to {“title”: None}.
agg_type (str, optional) – Aggregation type for metaprofile (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
normalize_internal (bool, optional) – Whether to normalize data internally. Defaults to True.
differential_heatmap (bool, optional) – Whether to plot differential heatmap. Defaults to True.
figsize (tuple, optional) – The size of the figure. Defaults to (5, 3).
dpi (int, optional) – The resolution of the figure in dots per inch. Defaults to 300.
save (str, optional) – The file path to save the figure. If not provided, the figure will be displayed.
bins (list, optional) – List of bin sizes for gene scheme. Defaults to [50, 10, 50].

Returns:

None

Example:

>>> metaprofileAndHeatmap(data_metaplot, data_heatmap, subsets={"Subset 1": subset1, "Subset 2": subset2}, 
>>>                       agg_type='mean', normalize_internal=True, differential_heatmap=True,
>>>                       figsize=(10, 6), dpi=150, save='plot.png', bins=[50, 10, 50])

trxtools.plotting.plotAndFolding(df=Empty DataFrame Columns: [] Index: [], dG=Series([], dtype: float64), title='', start=None, stop=None, legend=True, figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', name='median', h_lines=[], lc='red', offset=0, fname=None)

Plot a figure similar to a box plot with additional delta G values.

Parameters:

df (pd.DataFrame) – DataFrame containing the data with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
dG (pd.Series) – Series containing delta G values.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
legend (bool, optional) – Whether to display the legend. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
color (str, optional) – Color for the median line. Defaults to ‘green’.
name (str, optional) – Name for the median line in the legend. Defaults to ‘median’.
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> plotAndFolding(df, dG, title="Plot with Delta G", start=10, stop=90, color='blue', name='median', h_lines=[20, 40, 60], lc='black', offset=5, fname='plot.png')

trxtools.plotting.plotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot all STAR statistics including reads, read length, mapping, mismatches, and chimeric reads.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats(df)

trxtools.plotting.plotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot chimeric reads and splices.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats_chimeric(df)

trxtools.plotting.plotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot STAR mapping statistics.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats_mapping(df)

trxtools.plotting.plotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot mapping errors including mismatches, deletions, and insertions.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats_mistmatches(df)

trxtools.plotting.plotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot the average input read length.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats_readLen(df)

trxtools.plotting.plotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)

Plot the number of reads after pre-processing.

Parameters:

df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.

Returns:

None

Example:

>>> plotSTARstats_reads(df)

trxtools.plotting.plot_as_box_plot(df=Empty DataFrame Columns: [] Index: [], title='', start=None, stop=None, name='median', figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', h_lines=[], lc='red', offset=0, fname=None)

Plot a figure similar to a box plot: median, 2nd and 3rd quartiles, and min-max range.

Parameters:

df (pd.DataFrame) – DataFrame containing the data with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
name (str, optional) – Name for the median line in the legend. Defaults to ‘median’.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
color (str, optional) – Color for the median line. Defaults to ‘green’.
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> df = pd.DataFrame({'position': range(100), 'median': np.random.rand(100), 'q1': np.random.rand(100), 'q3': np.random.rand(100), 'min': np.random.rand(100), 'max': np.random.rand(100)})
>>> plot_as_box_plot(df, title="Box Plot", start=10, stop=90, name='median', figsize=(10, 5), ylim=(0, 1), dpi=200, color='blue', h_lines=[20, 40, 60], lc='black', offset=5, fname='box_plot.png')

trxtools.plotting.plot_diff(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm', label1='reference', label2='', title='', start=None, stop=None, plot_medians=True, plot_ranges=True, legend=True, dpi=150, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', offset=0, fname=None)

Plot given dataset and reference, highlighting differences.

Parameters:

ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
dataset (pd.DataFrame) – DataFrame containing the data to compare with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
ranges (str, optional) – Type of range to highlight (‘mm’ for min-max or ‘qq’ for q1-q3). Defaults to ‘mm’.
label1 (str, optional) – Label for the reference dataset in the legend. Defaults to “reference”.
label2 (str, optional) – Label for the comparison dataset in the legend.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
plot_medians (bool, optional) – Whether to plot medians. Defaults to True.
plot_ranges (bool, optional) – Whether to plot ranges. Defaults to True.
legend (bool, optional) – Whether to display the legend. Defaults to True.
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> plot_diff('reference.csv', dataset, ranges='qq', label1='Reference', label2='Comparison', title='Difference Plot', start=10, stop=90, plot_medians=True, plot_ranges=True, h_lines=[20, 40, 60], lc='black', offset=5, fname='difference_plot.png')

trxtools.plotting.plot_heatmap(df=Empty DataFrame Columns: [] Index: [], title='Heatmap of differences between dataset and reference plot for RDN37-1', vmin=None, vmax=None, figsize=(20, 10), dpi=300, fname=None)

Plot heatmap of differences between dataset and reference.

Parameters:

df (pd.DataFrame) – DataFrame containing the differences.
title (str, optional) – Title of the plot. Defaults to ‘Heatmap of differences between dataset and reference plot for RDN37-1’.
vmin (float, optional) – Minimum value for the colormap. Defaults to None.
vmax (float, optional) – Maximum value for the colormap. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 300.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> plot_heatmap(df, title='Difference Heatmap', vmin=-1, vmax=1, figsize=(15, 8), dpi=200, fname='heatmap.png')

trxtools.plotting.plot_to_compare(ref, df=Empty DataFrame Columns: [] Index: [], color1='green', color2='black', legend=True, ref_label='', label='', title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', dpi=150, offset=300, fname=None)

Plot a figure to compare two datasets similar to a box plot.

Parameters:

ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df (pd.DataFrame) – DataFrame containing the data to compare.
color1 (str, optional) – Color for the reference dataset plot. Defaults to ‘green’.
color2 (str, optional) – Color for the comparison dataset plot. Defaults to ‘black’.
ref_label (str, optional) – Label for the reference dataset in the legend.
label (str, optional) – Label for the comparison dataset in the legend.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
legend (bool, optional) – Whether to display the legend. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 300.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.

Returns:

None

Example:

>>> plot_to_compare('reference.csv', df, color1='blue', color2='red', ref_label='Reference', label='Comparison', title='Comparison Plot', start=10, stop=90, h_lines=[20, 40, 60], lc='black', offset=5, fname='comparison_plot.png')

trxtools.plotting.select_colors(n=0, name_cmap='Spectral')

Select a list of colors from a colormap.

Parameters:

n (int, optional) – Number of colors to select, defaults to int()
name_cmap (str, optional) – Name of the colormap to use, defaults to ‘Spectral’

Returns:

List of colors

Return type:

list

trxtools.plotting.vennDiagram(*dataframes, labels=['df1', 'df2', 'df3'], colors=('skyblue', 'lightgreen', 'lightpink'), title=None, save=None)

Generate a Venn diagram for 2 or 3 DataFrames.

Parameters:

dataframes (list of pd.DataFrame) – Input DataFrames containing the data.
labels (list of str, optional) – Labels for the DataFrames. Defaults to [‘df1’,’df2’,’df3’].
colors (tuple of str, optional) – Colors for the Venn diagram. Defaults to (‘skyblue’, ‘lightgreen’, ‘lightpink’).
title (str, optional) – The title of the plot. Defaults to None.
save (str, optional) – The file path to save the figure. Defaults to None.

Returns:

None

Example:

>>> vennDiagram(df1, df2, labels=['Group 1', 'Group 2'], title='Venn Diagram')

trxtools.profiles module

trxtools.profiles.FoldingFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, ranges=0, offset=15, fold='dG65nt@30C')

Pulls folding information from BigWig folding data for a given gene.

Parameters:

gene_name (str) – str, name of the gene
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object with GTF and TAB files loaded
bwFWD (dict) – dict of pyBigWig objects for forward strand
bwREV (dict) – dict of pyBigWig objects for reverse strand
ranges (int) – int, flanks to be added for the gene, default 0
offset (int) – int, offset for folding data, default 15
fold (str) – str, name of output column, default “dG65nt@30C”

Returns:

DataFrame with folding information

Return type:

pd.DataFrame

Example:

>>> FoldingFromBigWig('gene1', gtf_object, bwFWD_dict, bwREV_dict)

trxtools.profiles.binCollect3(s1=Series([], dtype: float64), lengths=[500, 72, 500], bins=[50, 10, 50])

Collects and sums data from a series into bins of specified lengths.

Parameters:

s1 (pd.Series) – Series containing the input data
lengths (list) – List of lengths for each bin
bins (list) – List of number of bins for each length

Returns:

Series with the summed data from the bins

Return type:

pd.Series

Example:

>>> binCollect3(s1, lengths=[500,72,500], bins=[50, 10, 50])

trxtools.profiles.calculateFDR(data=Series([], dtype: float64), iterations=100, target_FDR=0.05)

Calculates False Discovery Rate (FDR) for a given dataset.

Parameters:

data (pd.Series) – Series containing the input data
iterations (int) – Number of iterations for random dataset generation, default 100
target_FDR (float) – Target False Discovery Rate, default 0.05

Returns:

Series with FDR applied

Return type:

pd.Series

Example:

>>> calculateFDR(data, iterations=100, target_FDR=0.05)

trxtools.profiles.compare1toRef(ref, dataset=Series([], dtype: float64), ranges='mm', heatmap=False, relative=False)

Takes Series and compare this with reference DataFrame.

Parameters:

ref (str or pd.DataFrame) – Path to csv file or DataFrame
dataset (pd.Series) – Series containing the dataset to compare
ranges (str) – ‘mm’ for min-max or ‘qq’ for q1-q3, default ‘mm’
heatmap (bool) – If True, return Series of differences to plot heatmap, default False
relative (bool) – If True, recalculate differences according to the peak size, default False

Returns:

DataFrame (heatmap=False) or Series (heatmap=True)

Return type:

pd.DataFrame or pd.Series

Example:

>>> compare1toRef(ref, dataset, ranges='mm', heatmap=False, relative=False)

trxtools.profiles.compareMoretoRef(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm')

Compares a DataFrame created by filter_df with a reference DataFrame.

Parameters:

ref (str or pd.DataFrame) – Path to csv file or DataFrame containing reference data
dataset (pd.DataFrame) – DataFrame containing the dataset to compare
ranges (str) – ‘mm’ for min-max or ‘qq’ for q1-q3, default ‘mm’

Returns:

DataFrame with comparison results

Return type:

pd.DataFrame

Example:

>>> compareMoretoRef(ref, dataset, ranges='mm')

trxtools.profiles.dictBigWig(files=[], path='', strands=True)

Preload BigWig files to memory using pyBigWig tools.

Parameters:

files (list) – list of str, filenames
path (str) – str, path to the files
strands (bool) – bool, if True handle strand-specific data, default True

Returns:

dict or tuple of dicts, BigWig objects

Return type:

dict or tuple

Example:

>>> dictBigWig(['sample1_fwd.bw', 'sample1_rev.bw'], '/path/to/files')
({'sample1': <pyBigWig object>}, {'sample1': <pyBigWig object>})

trxtools.profiles.findPeaks(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)

Find local extrema using SciPy argrelextrema function.

Parameters:

s1 (pd.Series) – Series data to localize peaks
window (int) – To smooth data before peak-calling, default 1 (no smoothing)
win_type (str) – Type of smoothing window, default “blackman”
order (int) – Minimal spacing between peaks, argrelextrema order parameter, default 20

Returns:

List of peaks

Return type:

list

Example:

>>> findPeaks(s1, window=1, win_type='blackman', order=20)

trxtools.profiles.findTroughs(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)

Find local minima using SciPy argrelextrema function.

Parameters:

s1 (pd.Series) – Series data to localize troughs
window (int) – To smooth data before trough-calling, default 1 (no smoothing)
win_type (str) – Type of smoothing window, default “blackman”
order (int) – Minimal spacing between minima, argrelextrema order parameter, default 20

Returns:

List of troughs

Return type:

list

Example:

>>> findTroughs(s1, window=1, win_type='blackman', order=20)

trxtools.profiles.geneFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, toStrip='', ranges=0)

Pull genome coverage from BigWig data for a given gene.

Parameters:

gene_name (str) – str, name of the gene
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object, GTF and TAB files loaded
bwFWD (dict) – dict, pyBigWig objects for forward strand
bwREV (dict) – dict, pyBigWig objects for reverse strand
toStrip (str) – str, name to be stripped
ranges (int) – int, flanks to be added for the gene, default 0

Returns:

DataFrame, genome coverage data

Return type:

pd.DataFrame

Example:

>>> geneFromBigWig('gene1', gtf_object, bwFWD_dict, bwREV_dict)

trxtools.profiles.ntotal(df=<class 'pandas.core.frame.DataFrame'>, drop=True)

Normalize data in DataFrame to fraction of total column.

Parameters:

df (pd.DataFrame) – DataFrame to be normalized
drop (bool) – If True, drop ‘position’ and ‘nucleotide’ columns, default True

Returns:

Normalized DataFrame

Return type:

pd.DataFrame

Example:

>>> ntotal(df, drop=True)

trxtools.profiles.parseConcatFile(path, gtf, use='reads', RPM=False, ranges=1000)

Parse concat file to extract gene data.

Parameters:

path (str) – str, path of the concat file
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object, GTF and TAB files loaded
use (str) – str, column name to use [‘reads’, ‘substitutions’, ‘deletions’], default “reads”
RPM (bool) – bool, if True use reads per million, default False
ranges (int) – int, flanks to be added for the gene, default 1000

Returns:

dict, DataFrames using gene name as a key

Return type:

dict

Example:

>>> parseConcatFile('/path/to/concat/file', gtf_object)

Warning

This function is deprecated and will be removed in future versions.

trxtools.profiles.preprocess(input_df=Empty DataFrame Columns: [] Index: [], let_in=[''], let_out=['wont_find_this_string'], stats=False, smooth=True, window=10, win_type='blackman', pseudocounts_param=True, ntotal_param=True)

Combines methods.filterExp and expStats. Returns DataFrame with chosen experiments, optionally apply smoothing and stats.

Parameters:

input_df (pd.DataFrame) – DataFrame containing the input data
let_in (list) – List of words that characterize experiment, default [‘’]
let_out (list) – List of words that disqualify experiments, default [‘wont_find_this_string’]
stats (bool) – If True, return stats for all experiments, default False
smooth (bool) – If True, apply smoothing window, default True
window (int) – Smoothing window size, default 10
win_type (str) – Type of smoothing window, default “blackman”
pseudocounts_param (bool) – If True, add 0.01 pseudocounts, default True
ntotal_param (bool) – If True, apply ntotal normalization, default True

Returns:

DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments

Return type:

pd.DataFrame

Example:

>>> preprocess(input_df, let_in=['exp1'], let_out=['exp2'], stats=True, smooth=True, window=10, win_type='blackman', pseudocounts_param=True, ntotal_param=True)

trxtools.profiles.pseudocounts(df=<class 'pandas.core.frame.DataFrame'>, value=0.01, drop=True)

Add pseudocounts to data to avoid zero values.

Parameters:

df (pd.DataFrame) – DataFrame to which pseudocounts will be added
value (float) – Value of the pseudocount to be added, default 0.01
drop (bool) – If True, drop ‘position’ and ‘nucleotide’ columns, default True

Returns:

DataFrame with pseudocounts added

Return type:

pd.DataFrame

Example:

>>> pseudocounts(df, value=0.01, drop=True)

trxtools.profiles.save_csv(data_ref=Empty DataFrame Columns: [] Index: [], datasets=Empty DataFrame Columns: [] Index: [], path=None)

Saves DataFrame to a CSV file.

Parameters:

data_ref (pd.DataFrame) – DataFrame containing reference data with ‘position’ and ‘nucleotide’ columns
datasets (pd.DataFrame) – DataFrame containing experimental data
path (str, optional) – Path to save the CSV file, default None

Returns:

DataFrame with combined reference and experimental data

Return type:

pd.DataFrame

Example:

>>> save_csv(data_ref, datasets, path='/path/to/save.csv')

trxtools.profiles.stripBigWignames(files=[])

Strip “_rev.bw” and “_fwd.bw” from file names.

Parameters:: files (list) – list of str, filenames
Returns:: list of str, unique names
Return type:: list
Example:

>>> stripBigWignames(['sample1_fwd.bw', 'sample1_rev.bw'])
['sample1']

trxtools.secondary module

trxtools.secondary.Lstem(vienna='')

Returns a list of positions where “(” is found using coordinates {1,inf}.

Parameters:: vienna (str) – Vienna format secondary structure
Returns:: List of positions with “(”
Return type:: list

Example: >>> Lstem(“((..))”) [1, 2]

trxtools.secondary.Rstem(vienna='')

Returns a list of positions where “)” is found using coordinates {1,inf}.

Parameters:: vienna (str) – Vienna format secondary structure
Returns:: List of positions with “)”
Return type:: list

Example: >>> Rstem(“((..))”) [5, 6]

trxtools.secondary.checkVienna(sequence='', vienna='')

Validates the integrity of a Vienna file by checking the sequence and structure lengths and the balance of parentheses.

Parameters:

sequence (str) – RNA sequence
vienna (str) – Vienna format secondary structure

Returns:

True if the Vienna file is valid, False otherwise

Return type:

bool

Example:

>>> checkVienna("GCAU", "((..))")
False

trxtools.secondary.loopStems(vienna='', sequence='', loopsList=None, testPrint=False)

Returns positions of stems of single hairpins and multiloop stems using coordinates {1:inf}.

Parameters:

vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loopsList (list) – List of loop positions (optional)
testPrint (bool) – Boolean to print test output, default=False

Returns:

List of stem positions and list of multistem positions

Return type:

tuple

Example:

>>> loopStems("((..))", "GCAU")
([(1, 6)], [])

trxtools.secondary.loops(vienna='')

Returns the first positions outside the loop, e.g., “.((….)).” returns [(3,8)].

Parameters:: vienna (str) – Vienna format secondary structure
Returns:: List of tuples with loop positions
Return type:: list

Example: >>> loops(“.((….)).”) [(3, 8)]

trxtools.secondary.substructures(vienna='', sequence='')

Lists sub-structures of the given structure.

Parameters:

vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence

Returns:

Series of sub-structures

Return type:

pd.Series

Example:

>>> substructures("((..))", "GCAU")
stem_1    GCAU
dtype: object

trxtools.secondary.test(vienna='', sequence='', loops=None, stems=None, multistems=None, linkers=None)

Prints the Vienna structure with given features marked.

Parameters:

vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loops (list) – List of tuples with loop positions (optional)
stems (list) – List of tuples with stem positions (optional)
multistems (list) – List of multistem positions (optional)
linkers (list) – List of linker positions (optional)

Returns:

None

Example: >>> test(“((..))”, “GCAU”, loops=[(3, 4)]) 1 ((..)) GCAU __O_O

trxtools.secondary.vienna2format(vienna='', sequence='', loopsList=None, stemsList=None, multistemsList=None, testPrint=False)

Converts Vienna format to a custom format with letters: O - loop, S - stem, M - multiloop stem, and L - linker.

Parameters:

vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loopsList (list) – List of loop positions (optional)
stemsList (list) – List of stem positions (optional)
multistemsList (list) – List of multistem positions (optional)
testPrint (bool) – Boolean to print test output, default=False

Returns:

Custom format string

Return type:

str

Example:

>>> vienna2format("((..))", "GCAU")
'SSLLSS'