trxtools package
trxtools.sam module
trxtools.sam.SAMgeneral submodule
- trxtools.sam.SAMgeneral.countDeletion(i=(), expand=0)
Takes tuple (position,CIGARstring) and returns list of mapped deletions. Each deletion is counted once, but expand parameter can merge some longer deletions (common expansion).
- Parameters:
i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0
- Returns:
list of mapped positions
- Return type:
np.array
>>> countDeletion((400,"3S15M1D9M2S")) array([415]) >>> countDeletion((400,"3S15M1D9M2S"),expand=3) array([412, 413, 414, 415, 416, 417, 418])
- trxtools.sam.SAMgeneral.countMiddle(i=(), expand=0)
Takes tuple (position,CIGARstring) and returns list with the middle of mapped read
- Parameters:
i (tuple) – tuple (first position of the read ,CIGAR string)
expand (int, optional) – number of nucleotides to expand for each side, defaults to 0
- Returns:
list of mapped positions
- Return type:
np.array
>>> countMiddle((400,"3S15M1D9M2S")) array([412]) >>> countMiddle((400,"3S15M1D9M2S"),expand=3) array([409, 410, 411, 412, 413, 414])
- trxtools.sam.SAMgeneral.countRead(i=())
Takes tuple (position,CIGARstring) and returns list of mapped positions
- Parameters:
i (tuple) – tuple (first position of the read ,CIGAR string)
- Returns:
list of mapped positions
- Return type:
np.array
>>> countRead((400,"3S15M1D9M2S")) array([400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424])
- trxtools.sam.SAMgeneral.groupCIGAR(cigar_string='')
Split CIGAR string to list of tuples
- Parameters:
cigar_string (str)
- Returns:
list of tuples
[( ),()]
- Return type:
list
>>> groupCIGAR("3S44M1S1H") [('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]
- trxtools.sam.SAMgeneral.noncoded2profile(df_input=Empty DataFrame Columns: [] Index: [], df_details=Empty DataFrame Columns: [] Index: [])
Turns non-coded ends into profile
- Parameters:
df_input (DataFrame) – output of parseNoncoded function
df_details (DataFrame) – chromosome lengths
- Returns:
DataFrame with profiles
- Return type:
DataFrame
- trxtools.sam.SAMgeneral.noncoded2profile1(df=Empty DataFrame Columns: [] Index: [], length=0)
Turns non-coded ends into profile
- Parameters:
df_input (DataFrame) – output of parseNoncodedList function
df_details (DataFrame) – chromosome lengths
- Returns:
Series with profiles
- Return type:
Series
- trxtools.sam.SAMgeneral.parseNoncoded(d={}, minLen=3)
Parse dict with non-coded ends and returns structured DataFrame
- Parameters:
d (dict) – dictionary with list of tuples for each chromosme
{"chrI" : [], "chrII" : []}
, defaults to dict()minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3
- Returns:
DataFrame with parsed non-coded ends
- Return type:
DataFrame
>>> parseNoncoded({"chrI":[(40, 'AAA'), (35, 'AACAA')]}) index AAA AACAA chr 0 40 1.0 NaN chrI 1 35 NaN 1.0 chrI
- trxtools.sam.SAMgeneral.parseNoncodedList(l=[], minLen=3)
Parse list with non-coded ends and returns structured DataFrame
- Parameters:
l (list) – list of tuples
[(int,str)]
, defaults to lits()minLen (int, optional) – minimal length for non-coded end to keep, defaults to 3
- Returns:
DataFrame with parsed non-coded ends
- Return type:
DataFrame
>>> parseNoncodedList([(40, 'AAA'), (35, 'AACAA')]) AAA AACAA 35 NaN 1.0 40 1.0 NaN
- trxtools.sam.SAMgeneral.saveBigWig(paths={}, suffix='', bw_name='', chroms=[])
Save gzip pickle data to BigWig This function saves data from gzip pickle files to a BigWig file.
- Parameters:
paths (dict) – Dictionary with paths to gzip pickle files, where keys are file paths and values are file contents.
suffix (str, optional) – Suffix to be removed from chromosome names in the paths, defaults to an empty string.
bw_name (str) – Name of the output BigWig file.
chroms (list of tuples) – List of tuples with chromosome names and their lengths.
- Returns:
Success message indicating the suffix and that the file was saved successfully.
- Return type:
str
- Example:
>>> paths = {'/path/to/file_chr1.pkl.gz': 'data1', '/path/to/file_chr2.pkl.gz': 'data2'} >>> suffix = '3end' >>> bw_name = 'output.bw' >>> chroms = [('chr1', 1000000), ('chr2', 2000000)] >>> saveBigWig(paths, suffix, bw_name, chroms) '3end saved successfully'
- trxtools.sam.SAMgeneral.selectEnds(df=Empty DataFrame Columns: [] Index: [], ends='polyA')
Wrapper for functions selecting non-coded ends
- Parameters:
df (DataFrame) – output of parseNoncoded or parseNoncodedList
ends (str) – type of ends, currently only “polyA” is availible, defaults to “polyA”
- Returns:
runs selectPolyA
- Return type:
DataFrame
- trxtools.sam.SAMgeneral.selectPolyA(df=Empty DataFrame Columns: [] Index: [])
Select only polyA non-coded ends containinig
"AAA"
and “A”-content above 75%- Parameters:
df (DataFrame) – output of parseNoncoded or parseNoncodedList
- Returns:
modified DataFrame
- Return type:
DataFrame
- trxtools.sam.SAMgeneral.selectSortPaths(paths={}, chroms=[], suffix='')
Select and sort paths based on chromosome names and suffix This function filters and sorts the paths dictionary based on the provided chromosome names and suffix.
- Parameters:
paths (dict) – Dictionary with paths to gzip pickle files, where keys are file paths and values are file contents.
chroms (list of tuples) – List of tuples with chromosome names and their lengths.
suffix (str, optional) – Suffix to be matched in the file paths, defaults to an empty string.
- Returns:
Dictionary with filtered and sorted paths.
- Return type:
dict
- Example:
>>> paths = {'/path/to/file_chr1_3end.pkl.gz': 'data1', '/path/to/file_chr2_3end.pkl.gz': 'data2'} >>> chroms = [('chr1', 1000000), ('chr2', 2000000)] >>> suffix = '3end' >>> selectSortPaths(paths, chroms, suffix) {'/path/to/file_chr1_3end.pkl.gz': 'data1', '/path/to/file_chr2_3end.pkl.gz': 'data2'}
- trxtools.sam.SAMgeneral.stripCIGAR(match=[], to_strip='H')
Removes H from output of groupCIGAR
- Parameters:
match (list) – output of groupCIGAR, defaults to []
to_strip (str, optional) – CIGAR mark to be stripped, defaults to “H”
- Returns:
modified list of tuples
- Return type:
list
- trxtools.sam.SAMgeneral.stripSubstitutions(match)
Strip substutiotns on both ends
- Parameters:
match (list) – output of groupCIGAR, defaults to []
- Returns:
list of tuples
- Return type:
list
- trxtools.sam.SAMgeneral.tostripCIGARfive(match=[])
Calculates length of soft-clipped nucleotides at the 5’ end of the read
- Parameters:
match (list) – output of groupCIGAR, defaults to []
- Returns:
number of substituted nucleotides
- Return type:
int
>>> tostripCIGARfive([('3', 'S'), ('44', 'M'), ('1', 'S'), ('1', 'H')]) 3
- trxtools.sam.SAMgeneral.tostripCIGARthree(match=[])
Nucleotides without alignment at the 3’ end of the read
- Parameters:
match (list) – output of groupCIGAR, defaults to []
- Returns:
number of substituted nucleotides
- Return type:
int
>>> tostripCIGARthree([('3', 'S'), ('44', 'M'), ('1', 'S')]) 1
trxtools.sam.SAMgenome submodule
- trxtools.sam.SAMgenome.chromosome2profile3end(l=[], length=0, strand='FWD')
Generates profile for the 3’ ends of reads and saves position of non-coded end
- Parameters:
l (list) – list of triple tuples (position, cigar_string, sequence), defaults to []
length (int) – length of chromosome
strand (str, optional) –
'FWD'
or'REV'
, defaults to ‘FWD’
- Returns:
profile, noncoded
- Return type:
np.array, list of tuples
>>> chromosome2profile3end(l=[(10,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="FWD") (0 0.0 1 0.0 ... 34 0.0 35 1.0 36 0.0 ... 50 0.0 dtype: float64, [(35, 'CC')]) >>> chromosome2profile3end(l=[(40,"3S15M1D9M2S","TTTGCGCAGTCGTGCGGGGCGCAGCGCCC")],length=50,strand="REV") (0 0.0 1 0.0 ... 39 0.0 40 1.0 41 0.0 ... 50 0.0 dtype: float64, [(40, 'AAA')])
- trxtools.sam.SAMgenome.parseHeader(filename, name, dirPath)
Parses the header of a SAM file to extract chromosome names and lengths.
- Parameters:
filename (str) – Path to the SAM file
name (str) – Name of the experiment
dirPath (str) – Directory path to save intermediate files
- Returns:
DataFrame with chromosome names and lengths
- Return type:
pd.DataFrame
- Example:
>>> parseHeader("example.sam", "experiment1", "/path/to/dir") chrName length chr1 248956422 chr2 242193529 ...
- trxtools.sam.SAMgenome.reads2genome(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='read', logName='')
Function used by sam2genome. Works for both strands.
- Parameters:
name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes
- Returns:
output_df_fwd, output_df_rev, log
- Return type:
DataFrame, DataFrame, list
- trxtools.sam.SAMgenome.reads2genome3end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='3end', noncoded=True, ends='polyA', logName='', minLen=3)
Function used by sam2genome3end. Works for both strands.
- Parameters:
name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes
noncoded (bool, optional) – If True then will parse and save non-coded ends, defaults to True
- Returns:
output_df_fwd, output_df_rev, log, noncoded_fwd, noncoded_rev
- Return type:
DataFrame, DataFrame, list, DataFrame, DataFrame
- trxtools.sam.SAMgenome.reads2genome5end(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='5end', logName='')
Function used by sam2genome5end. Works for both strands.
- Parameters:
name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes
- Returns:
output_df_fwd, output_df_rev, log
- Return type:
DataFrame, DataFrame, list
- trxtools.sam.SAMgenome.reads2genomeDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], use='del', expand=0, logName='')
Function used by sam2genome. Works for both strands.
- Parameters:
name (str) – name of experiment
dirPath (str)
df_details (DataFrame) – lengths of chromosomes
- Returns:
output_df_fwd, output_df_rev, log
- Return type:
DataFrame, DataFrame, list
- trxtools.sam.SAMgenome.sam2genome(filename='', path='', toClear='', chunks=0, use='3end', expand=0, noncoded=True, noncoded_suffix='polyA')
Function handling SAM files and generating profiles. Executed using wrapping script SAM2profilesGenomic.py.
- Parameters:
filename (str)
path (str)
toClear (str, optional) – element of filename to be removed, defaults to ‘’
chunks (int, optional) – Read SAM file in chunks, defaults to 0
noncoded_pA (bool, optional) – Save non-coded polyA ends, defaults to True
noncoded_raw (bool, optional) – Save all non-coded ends, defaults to False
trxtools.sam.SAMtranscripts submodule
- trxtools.sam.SAMtranscripts.reads2profile(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [])
Takes a list of aligned reads and transforms them into a profile for each transcript. Tested only for PLUS strand.
- Parameters:
name (str) – The name of the file (without extension) containing the aligned reads.
dirPath (str) – The directory path where the file is located.
df_details (pandas.DataFrame) – A DataFrame containing details about the transcripts, including their lengths.
- Returns:
A tuple containing the output DataFrame with profiles and a log list with status messages.
- Return type:
tuple (pandas.DataFrame, list)
- Example:
>>> output_df, log = reads2profile(name='aligned_reads', dirPath='/path/to/files', df_details=df_details)
- trxtools.sam.SAMtranscripts.reads2profileDeletions(name='', dirPath='', df_details=Empty DataFrame Columns: [] Index: [], expand=5)
Takes a list of aligned reads and transforms them into profiles for each transcript, including deletions. Tested only for PLUS strand.
- Parameters:
name (str) – The name of the file (without extension) containing the aligned reads.
dirPath (str) – The directory path where the file is located.
df_details (pandas.DataFrame) – A DataFrame containing details about the transcripts, including their lengths.
expand (int) – The number of bases to expand around deletions.
- Returns:
A tuple containing the output DataFrames with profiles (reads, deletions, deletions with expansion) and a log list with status messages.
- Return type:
tuple (pandas.DataFrame, pandas.DataFrame, pandas.DataFrame, list)
- Example:
>>> output_df, outputDel_df, outputDelExt_df, log = reads2profileDeletions(name='aligned_reads', dirPath='/path/to/files', df_details=df_details, expand=5)
- trxtools.sam.SAMtranscripts.sam2profiles(filename='', path='', geneList=[], toClear='', df_details=Empty DataFrame Columns: [] Index: [], deletions=False, expand=5, pickle=False, chunks=0)
Function handling SAM files and generating profiles. Executed using wrapping script SAM2profiles.py.
- Parameters:
filename (str) – The name of the SAM file to be processed.
path (str) – The directory path where the SAM file is located.
geneList (list) – List of transcripts to be selected.
toClear (str, optional) – Element of filename to be removed, defaults to ‘’.
df_details (pandas.DataFrame) – DataFrame containing details of transcripts.
deletions (bool, optional) – Whether to generate profile of deletions, defaults to False.
expand (int, optional) – Number of bases to expand around deletions, defaults to 5.
pickle (bool, optional) – Whether to save output in pickle format, defaults to False.
chunks (int, optional) – Number of chunks to read SAM file in, defaults to 0.
- Returns:
None
- Example:
>>> sam2profiles(filename="example.sam", path="/path/to/files", geneList=["gene1", "gene2"], df_details=df_details)
- trxtools.sam.SAMtranscripts.transcript2profile(l=[], length=0)
Takes a list of tuples (position, CIGARstring) and generates a profile. Works with entire reads, for both strands.
- Parameters:
l (list) – List of tuples where each tuple contains a position and a CIGAR string.
length (int) – The length of the profile to be generated.
- Returns:
A pandas Series representing the profile, with positions as the index and counts as the values.
- Return type:
pandas.Series
- Example:
>>> l = [(1, '10M'), (2, '5M')] >>> length = 10 >>> profile = transcript2profile(l, length) >>> print(profile) 1 1.0 2 2.0 3 2.0 4 2.0 5 2.0 6 2.0 7 1.0 8 1.0 9 1.0 10 1.0 dtype: float64
- trxtools.sam.SAMtranscripts.transcript2profileDeletions(l=[], expand=0, length=0)
Takes a list of tuples (position, CIGARstring) and generates a profile for deletions.
- Parameters:
l (list) – List of tuples where each tuple contains a position and a CIGAR string.
expand (int) – The number of bases to expand around deletions.
length (int) – The length of the profile to be generated.
- Returns:
A pandas Series representing the profile, with positions as the index and counts as the values.
- Return type:
pandas.Series
- Example:
>>> profile = transcript2profileDeletions(l, expand, length)
trxtools.BigWig module
- trxtools.BigWig.foldingFromBigWig(gene_name, data_path, data_files, gtf, ranges=0, range5end=0, offset=15)
Extracts folding energy for a given gene from BigWig files, and applies a folding offset to the data.
- Parameters:
gene_name (str) – Name of the gene to extract data for.
data_path (str) – Path to the directory containing BigWig files.
data_files (list) – List of BigWig file names to extract data from.
gtf (GTF) – GTF file object containing gene annotations.
ranges (int, optional) – Range of nucleotides to extract, defaults to 0.
range5end (int, optional) – Number of bases to exclude from the 5’ end, defaults to 0.
offset (int, optional) – Number of bases to shift the data by, defaults to 15.
- Returns:
DataFrame containing nucleotide sequences and associated data with applied folding offset.
- Return type:
pd.DataFrame
- Example:
>>> gtf = GTF("/path/to/annotations.gtf") >>> df = foldingFromBigWig("BRCA1", "/path/to/bigwig/", ["file1.bw", "file2.bw"], gtf, ranges=1000, range5end=100, offset=15) >>> print(df.head())
- trxtools.BigWig.geneFromBigWig(gene_name, data_path, data_files, gtf, ranges=0, verbose=False)
Extracts nucleotide sequences and associated data for a given gene from BigWig files.
- Parameters:
gene_name (str) – Name of the gene to extract data for.
data_path (str) – Path to the directory containing BigWig files.
data_files (list) – List of BigWig file names to extract data from.
gtf (GTF) – GTF file object containing gene annotations.
ranges – Range of nucleotides to extract, defaults to 0.
verbose – If True, prints the name of each BigWig file being processed, defaults to False.
- Returns:
DataFrame containing nucleotide sequences and associated data.
- Return type:
pd.DataFrame
- Example:
>>> gtf = GTF("/path/to/annotations.gtf") >>> df = geneFromBigWig("BRCA1", "/path/to/bigwig/", ["file1.bw", "file2.bw"], gtf, ranges=1000, verbose=True) >>> print(df.head())
- trxtools.BigWig.getSeqData(gene_name, data_path, name, gtf, ranges=0)
Load BigWig data for a gene/chromosome region and return as pandas Series.
- Parameters:
gene_name (str) – Name of the gene/chromosome to retrieve data for.
data_path (str) – Path to the directory containing BigWig files.
name (str) – Base name of the BigWig files (without strand suffix and extension). Can be obtained using strip_BigWig_names().
gtf (object) – GTF object that provides gene information such as strand, chromosome, and coordinates.
ranges – Number of bases to extend the region on both sides, defaults to 0.
- Returns:
BigWig data for the specified gene or region as a pandas Series.
- Return type:
pandas.Series
- Example:
>>> gtf = GTF('/path/to/gtf_file.gtf') >>> data = getSeqData('BRCA1', '/path/to/bigwig/', 'sample', gtf, ranges=100) >>> print(data)
- trxtools.BigWig.strip_BigWig_names(files=[])
Remove _fwd.bw and _rev.bw from the list of BigWig files
- Parameters:
files (list, required) – list of filenames, defaults to list()
- Returns:
list of unique names
- Return type:
list
>>> strip_BigWig_names(["file1_fwd.bw", "file1_rev.bw", "file2_fwd.bw", "file2_rev.bw"]) ['file2', 'file1']
trxtools.metaprofiles module
- trxtools.metaprofiles.bed_split_strands(bed_df)
Splits a BED dataframe into two separate dataframes based on strand information.
- Parameters:
bed_df (pandas.DataFrame) – A dataframe containing BED format data. The strand information is expected to be in the 6th column (index 5).
- Returns:
A tuple containing two dataframes: - bed_plus (pandas.DataFrame): Dataframe containing entries with the ‘+’ strand. - bed_minus (pandas.DataFrame): Dataframe containing entries with the ‘-’ strand.
- Return type:
tuple
- trxtools.metaprofiles.binMultipleMatrices(mm={}, bins=[50, 10, 50], bed_df=Empty DataFrame Columns: [] Index: [], flank_5=None, flank_3=None)
Bin multiple matrices of tRNA profiles into a single dataframe
- Parameters:
mm (dict) – dictionary of matrices of gene profiles
bins (list, optional) – list of three integers, defaults to [50, 10, 50]
bed_df (pandas.DataFrame, optional) – dataframe in BED format containing genomic coordinates of target regions, defaults to pd.DataFrame()
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to None
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to None
- Raises:
ValueError – if bins is not a list of three integers
ValueError – if input_value is not an integer or a DataFrame
- Returns:
dictionary of binned matrices of tRNA profiles
- Return type:
dict
- trxtools.metaprofiles.getMultipleMatrices(bw_paths_plus, bw_paths_minus, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True, align_3end=False)
Get score matrices for positions in given regions (with optional flanks) from multiple BigWig files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks.
- Parameters:
bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – _description_, defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True
align_3end (bool, optional) – if True, align profiles at the 3’end of features. Otherwise align at 5’end, defaults to False
- Returns:
A dictionary containing score matrices for individual BigWig files. Dictionary keys are BigWig file names.
- Return type:
dict
- trxtools.metaprofiles.getMultipleMatricesFromPeak(peak_paths=[], bed_df=<class 'pandas.core.frame.DataFrame'>, g='references/GRCh38.primary_assembly.genome.cleaned.len', flank_5=0, flank_3=0)
Get score matrices for positions in given regions (with optional flanks) from multiple peak files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. :param peak_paths: list of paths to peak files (broadPeak or narrowPeak format) :type peak_paths: list :param bed_df: dataframe in BED format containing genomic coordinates of target regions :type bed_df: pandas.DataFrame :param g: Path to genome file (chromosome lengths), defaults to ‘references/GRCh38.primary_assembly.genome.cleaned.len’ :type g: str :param flank_5: length of 5’flank to extend BED regions by, defaults to 0 :type flank_5: int, optional :param flank_3: length of 3’flank to extend BED regions by, defaults to 0 :type flank_3: int, optional :return: dictionary of score matrices for each peak file :rtype: dict
- trxtools.metaprofiles.get_bw_data(bed_row, bw, flank_5=0, flank_3=0, align_3end=False)
Retrieve BigWig scores for positions in a given region, optionally including flanks of given length.
- Parameters:
bed_row (list or pandas.Series) – A row from a BED file, containing chromosome, start, end, and strand information.
bw (pyBigWig) – A pyBigWig object to retrieve values from.
flank_5 (int, optional) – Length of the 5’ flank to include, defaults to 0.
flank_3 (int, optional) – Length of the 3’ flank to include, defaults to 0.
align_3end (bool, optional) – Whether to align the series to the 3’ end, defaults to False.
- Returns:
A pandas Series containing the BigWig scores for the specified region.
- Return type:
pandas.Series
- trxtools.metaprofiles.get_multiple_matrices(bw_paths_plus, bw_paths_minus, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True, align_3end=False)
WARNINIG: This function is replaced by getMultipleMatrices()
Get score matrices for positions in given regions (with optional flanks) from multiple BigWig files. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks.
- Parameters:
bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – _description_, defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True
align_3end (bool, optional) – if True, align profiles at the 3’end of features. Otherwise align at 5’end, defaults to False
- Returns:
A dictionary containing score matrices for individual BigWig files. Dictionary keys are BigWig file names.
- Return type:
dict
Warning
This function is deprecated and will be removed in future versions.
- trxtools.metaprofiles.join_strand_matrices(plus_dict, minus_dict)
Combine score matrices for regions on + and - strands.
- Parameters:
plus_dict (dict) – Dictionary containing score matrices for regions on the + strand.
minus_dict (dict) – Dictionary containing score matrices for regions on the - strand.
- Raises:
Exception – If keys in dictionaries do not match.
- Returns:
Dictionary containing combined score matrices for regions on both strands.
- Return type:
dict
- trxtools.metaprofiles.matrixFromBigWig(bw_path, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, align_3end=False)
Get matrix with BigWig scores for all regions in a bed df from a single BigWig file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.
- Parameters:
bw_path (str) – Path to target BigWig file
bed_df (pandas.DataFrame) – DataFrame with BED data (use read_bed())
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to retrieved scores, defaults to None
align_3end (bool, optional) – If true, position 0 in the resulting matrix will be set at the target region’s 3’end instead of 5’end, defaults to False
- Returns:
DataFrame with the result score matrix
- Return type:
pandas.DataFrame
- trxtools.metaprofiles.matrix_from_bigwig(bw_path, bed_df, flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, align_3end=False)
WARNINIG: This function is replaced by matrixFromBigWig()
Get matrix with BigWig scores for all regions in a bed df from a single BigWig file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.
- Parameters:
bw_path (str) – Path to target BigWig file
bed_df (pandas.DataFrame) – DataFrame with BED data (use read_bed())
flank_5 (int, optional) – length of 5’flank to extend BED regions by, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to retrieved scores, defaults to None
align_3end (bool, optional) – If true, position 0 in the resulting matrix will be set at the target region’s 3’end instead of 5’end, defaults to False
- Returns:
DataFrame with the result score matrix
- Return type:
pandas.DataFrame
Warning
This function is deprecated and will be removed in future versions.
- trxtools.metaprofiles.metaprofile(matrix_dict, agg_type='mean', normalize_internal=False, subset=None)
Calculate metaprofiles from score matrices by aggregating each position in all regions.
- Parameters:
matrix_dict (dict) – Dict containing score matrices returned by get_multiple_matrices()
agg_type (str, optional) – Type of aggregation to use. Available: ‘mean’, ‘median’, ‘sum’, defaults to ‘mean’
normalize (bool, optional) – if true, normalize each profile internally (i.e. gene-wise) (x/sum(x)) before aggregating, defaults to False
- Raises:
Exception – _description_
- Returns:
dataframe containing metaprofile values for each position in each of the input matrices (i.e. bigwig files)
- Return type:
pandas.DataFrame
- trxtools.metaprofiles.peak2matrice(bed_df=<class 'pandas.core.frame.DataFrame'>, peak_file_path='', g='references/GRCh38.primary_assembly.genome.cleaned.len', flank_5=0, flank_3=0, fillna=True)
Get score matrices for positions in given regions (with optional flanks) from a single nroadPeak or narrowPeak file. Matrix rows correspond to regions in BED. Columns correpond to nucleotide positions in regions + flanks. Flanks are strand-aware.
- Parameters:
bed_df (pandas.DataFrame) – DataFrame with BED data (can use read_bed())
peak_file_path (str) – Path to peak file (broadPeak or narrowPeak format)
g (str) – Path to genome file (chromosome lengths), defaults to ‘references/GRCh38.primary_assembly.genome.cleaned.len’
flank_5 (int, optional) – length of 5’flank to extend BED regions by, strand-aware, defaults to 0
flank_3 (int, optional) – length of 3’flank to extend BED regions by, strand-aware, defaults to 0
fill_na (bool, optional) – If true replace NaN values with 0 (pybigwig returns positions with 0 coverage as NaN), defaults to True
- Returns:
DataFrame with the result score matrix
- Return type:
pandas.DataFrame
- trxtools.metaprofiles.readBED(bed_path)
Simple BED file parser
- Parameters:
bed_path (str) – Path to BED file
- Returns:
Contents of BED file in DataFrame form
- Return type:
pandas.DataFrame
- trxtools.metaprofiles.read_bed(bed_path)
Simple BED file parser
- Parameters:
bed_path (str) – Path to BED file
- Returns:
Contents of BED file in DataFrame form
- Return type:
pandas.DataFrame
Warning
This function is deprecated and will be removed in future versions.
- trxtools.metaprofiles.regionScore(bw_paths_plus, bw_paths_minus, bed_df, agg_type='sum', flank_5=0, flank_3=0, fill_na=True, pseudocounts=None, normalize_libsize=True)
Calculate coverage or statistic for multiple regions in multiple BigWig files.
- Parameters:
bw_paths_plus (list) – list of paths to BigWig files (+ strand)
bw_paths_minus (list) – list of paths to BigWig files (- strand)
bed_df (pandas.DataFrame) – dataframe in BED format containing genomic coordinates of target regions
agg_type (str, optional) – operation to perform on region scores. Available options: ‘sum’, ‘mean’, ‘median’, defaults to ‘sum’
flank_5 (int, optional) – number of nt that input regions will be extended by on 5’ side, defaults to 0
flank_3 (int, optional) – number of nt that input regions will be extended by on 3’ side, defaults to 0
fill_na (bool, optional) – If true, NaNs will be replaced with zeroes (recommended, as pyBigWig reports positions with 0 score as NaN), defaults to True
pseudocounts (float, optional) – pseudocounts to add to datapoints, defaults to None
normalize_libsize (bool, optional) – normalization to library size (sum of scores in a bigwig file), defaults to True
- Returns:
DataFrame with calculated scores. Rows are regions/genes, columns are BigWig files.
- Return type:
pandas.DataFrame
- trxtools.metaprofiles.selectSubsetMM(matrix_dict, subset=None)
Select a subset of regions from the matrix dictionary.
- Parameters:
matrix_dict (dict) – Dictionary containing score matrices.
subset (Union[pd.DataFrame, list], optional) – DataFrame or list containing the subset of regions to select, defaults to None
- Returns:
Dictionary containing the selected subset of score matrices.
- Return type:
dict
trxtools.assays module
- trxtools.assays.bulkInputIDT(data=Empty DataFrame Columns: [] Index: [])
Transforms dataframe with columns “template” and “non-template” and prepares table to be used as bulk input. Discards oligos longer than 200 nt.
- Parameters:
data (DataFrame) – Table with columns “template” and “non-template”
- Returns:
Table with sequences to order
- Return type:
DataFrame
- trxtools.assays.extruded(seq='', buried=13)
Returns the extruded sequence by removing the buried nucleotides from the end
- Parameters:
seq (str) – sequence
buried (int) – length of sequence buried within RNAP
- Returns:
extruded sequence
- Return type:
str
- trxtools.assays.findPrimer(seq='', query='')
Find query sequence within given sequence
- Parameters:
seq (str) – sequence
query (str) – query within searched sequences
- Returns:
(start,stop) in python indexing
- Return type:
tuple
- Example:
>>> findPrimer("ATCGATCG", "ATCG") (0, 4)
- trxtools.assays.nonTemplateDNA(seq='', primer='')
Returns the DNA sequence of the non-template strand.
- Parameters:
seq (str) – sequence
primer (str) – primer sequence
- Returns:
DNA sequence of non-template strand
- Return type:
str
- trxtools.assays.sequenceConstrain(structure='', stall='', RNAprimer='')
Returns sequence constrained with the 5prime RNA primer
- Parameters:
structure (str) – sequence structure
stall (str) – single letter representing the stall sequence
RNAprimer (str) – RNA primer sequence
- Returns:
sequence constraints for folding algorithm
- Return type:
str
- trxtools.assays.stalled(seq='', stall='AAA', primer='')
Finds and returns stalled sequence
- Parameters:
seq (str) – sequence
stall (str) – stall sequence, default “AAA”
primer (str) – primer sequence
- Returns:
stalled sequence
- Return type:
str
- trxtools.assays.structureFile(structure='', stall='', RNAprimer='')
Saves structure file in current directory
- Parameters:
structure (str) – sequence structure
stall (str) – single letter representing the stall sequence
RNAprimer (str) – RNA primer sequence
- Returns:
True
- Return type:
bool
- trxtools.assays.templateDNA(seq='', overhang5end='')
Returns the DNA sequence of the template strand. Takes reverse complement of RNA sequence and the 5’end DNA overhang as input.
- Parameters:
seq (str) – sequence of RNA
overhang5end (str) – sequence of the 5’end DNA overhang
- Returns:
DNA sequence of template strand
- Return type:
str
- trxtools.assays.testScaffold(data=Empty DataFrame Columns: [] Index: [], overhang5end='', RNA_primer='')
Prints scaffold to test before ordering
- Parameters:
data (DataFrame) – DataFrame with columns “template” and “non-template”
overhang5end (str) – sequence of the 5’end DNA overhang
RNA_primer (str) – RNA primer sequence
- Returns:
None
- trxtools.assays.toOrder(data=Empty DataFrame Columns: [] Index: [], buried='GUCUGUUUUGUGG', stallFull='AAA', afterStall='TGATCGGTAC', overhang5end='TGA', RNA_primer='AGGCCGAAA', bulkInput=True, test=True, lengthMax=200)
Prepares sequences for ordering by applying folding functions to DNA sequences
- Parameters:
data (DataFrame) – DataFrame with columns “sequence” and “name”
buried (str) – sequence buried within RNAP, default “GUCUGUUUUGUGG”
stallFull (str) – stall sequence, default “AAA”
afterStall (str) – sequence after stall, default “TGATCGGTAC”
overhang5end (str) – sequence of the 5’end DNA overhang, default “TGA”
RNA_primer (str) – RNA primer sequence, default “AGGCCGAAA”
bulkInput (bool) – whether to prepare bulk input for ordering, default True
test (bool) – whether to print scaffold for testing before ordering, default True
lengthMax (int) – maximum length of sequences to order, default 200
- Returns:
DataFrame of sequences to order
- Return type:
DataFrame
trxtools.go_enrichment module
- trxtools.go_enrichment.get_enrichment(query_genes, organism, go_dataset='biological_process', use_reference_set=False, ref_genes=None, ref_organism=None, test_type='FISHER', correction='FDR')
Run a GO term enrichment test using PANTHER API
- Parameters:
query_genes (list) – List of sequence identifiers of queried genes (e.g. transcript ids, gene ids)
organism (str) – Taxid of query species (e.g. “9606” for H. sapiens)
go_dataset (str) – Which annotation dataset to query, “biological_process” or “molecular_function”
use_reference_set (bool) – Use a custom set of rerence (background) genes? Default False. If True, ref_genes and ref_species need to be specifed.
ref_genes (list) – optional list of reference genes. Specifying None (default) will use the whole genome of species specified in organism. When passing a list, ref_organism taxid must also be provided.
ref_species (str) – Taxid of reference species, required when ref_genes is not None
test_type (str) – Which tatistical test to use. Available: “FISHER” (default), “BINOMIAL”
correction (str) – Which multiple testing correction method to use. Available: “FDR” (default), “BONFERRONI”, “NONE”
- Returns:
Unfiltered DataFrame of results.
- Return type:
pandas.DataFrame:
trxtools.methods module
- trxtools.methods.DNA_stretch_positions(df, char, min_len, max_len, name_col='name', seq_col='sequence')
Finds all character stretches of given length in a string (i.e. DNA sequence) and reports their positions. Wrapper for DNA_string_positions()
- Parameters:
df (pandas.DataFrame) – Input dataframe
char (str) – Query character (base)
min_len (int) – minimum length of stretch to look for
min_len – maximum length of stretch to look for
name_col (str, optional) – name of column containing sequence names, defaults to ‘name’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’
- Returns:
Dataframe in long format containing positions of found string
- Return type:
pandas.DataFrame
- trxtools.methods.DNA_string_positions(df, string, name_col='name', seq_col='sequence')
Finds all occurences of a given substring in a string (i.e. DNA sequence) and reports their positions
- Parameters:
df (pandas.DataFrame) – Input dataframe
string (str) – Query string
name_col (str, optional) – name of column containing sequence names, defaults to ‘name’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’
- Returns:
Dataframe in long format containing positionsof found string
- Return type:
pandas.DataFrame
- trxtools.methods.addCluster(df=Empty DataFrame Columns: [] Index: [], n=10)
Assigns n clusters to the data using KMeans algorithm
- Parameters:
df (pandas.DataFrame) – DataFrame containing the data to be clustered
n (int, optional) – Number of clusters to form, defaults to 10
- Returns:
DataFrame with an additional ‘cluster’ column indicating cluster assignment
- Return type:
pandas.DataFrame
- Example:
>>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [5, 4, 3, 2, 1]}) >>> addCluster(df, n=2) x y cluster 0 1 5 1 1 2 4 1 2 3 3 0 3 4 2 0 4 5 1 0
- trxtools.methods.bashCommand(bashCommand='')
Run command in bash using subprocess.call()
- Parameters:
bashCommand – str with command
- Returns:
None
- trxtools.methods.bed2len(bed=Empty DataFrame Columns: [] Index: [])
Convert a bed DataFrame to a length DataFrame.
- Parameters:
bed (pandas.DataFrame) – pandas DataFrame containing bed data.
- Returns:
pandas DataFrame with length data.
- Return type:
pandas.DataFrame
- Example:
>>> bed = pd.DataFrame({'chr': ['chr1', 'chr1'], 'start': [100, 200], 'stop': [150, 250], 'region': ['region1', 'region2']}) >>> bed2len(bed) region region1 50 region2 50 dtype: int64
- trxtools.methods.calGC(dataset=Empty DataFrame Columns: [] Index: [], calFor=['G', 'C'])
Returns GC content in a given string - uses [‘nucleotide’] column
- Parameters:
dataset (pandas.DataFrame) – DataFrame() with “nucleotide” column
calFor (list) – list of nucleotides to calculate GC content
- Returns:
fraction of GC content, float
- Return type:
float
- trxtools.methods.cleanNames(data, strings=[])
Cleans the names in the given data by removing specified strings.
- Parameters:
data (dict or pandas.DataFrame) – The data to be cleaned. It can be either a dictionary or a pandas DataFrame.
strings (list, optional) – A list of strings to be removed from the names. Defaults to an empty list.
- Returns:
The cleaned data with names modified according to the specified strings.
- Return type:
dict or pandas.DataFrame
- trxtools.methods.define_experiments(paths_in, whole_name=False, strip='_hittable_reads.txt')
Parse file names and extract experiment name from them
- Parameters:
paths_in (list) – List of file paths
whole_name (bool, optional) – Whether to use the whole file name as the experiment name, defaults to False
strip (str, optional) – String to strip from the file name, defaults to ‘_hittable_reads.txt’
- Returns:
List of experiment names and list of paths
- Return type:
tuple
- trxtools.methods.enriched(df, padj=0.05, fc=2)
Filter DataFrame for enriched genes based on adjusted p-value and log2 fold change.
- Parameters:
df (pandas.DataFrame) – DataFrame containing gene expression data
padj (float, optional) – Adjusted p-value threshold, defaults to 0.05
fc (float, optional) – Log2 fold change threshold, defaults to 2
- Returns:
Filtered DataFrame with enriched genes
- Return type:
pandas.DataFrame
- trxtools.methods.expNameParser(name, additional_tags=[], order='b_d_e_p')
Function handles experiment name; recognizes AB123456 as experiment date; BY4741 or HTP or given string as bait protein
- Parameters:
name (str) – Experiment name
additional_tags (list) – List of additional tags
order (str, optional) – Order of elements in the output, defaults to ‘b_d_e_p’
- Returns:
Reordered experiment name
- Return type:
str
- trxtools.methods.expStats(input_df=Empty DataFrame Columns: [] Index: [], smooth=True, window=10, win_type='blackman')
Returns DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments
- Parameters:
input_df (pandas.DataFrame) – DataFrame containing the input data
smooth (bool, optional) – Whether to apply smoothing window, defaults to True
window (int, optional) – Smoothing window size, defaults to 10
win_type (str, optional) – Type of smoothing window, defaults to ‘blackman’
- Returns:
DataFrame with calculated statistics
- Return type:
pandas.DataFrame
- trxtools.methods.filterExp(datasets, let_in=[''], let_out=['wont_find_this_string'], verbose=False)
Returns object with filtered columns/keys.
- Parameters:
datasets – DataFrame or dict with exp name as a key
let_in (list) – List of elements of name to filter in
let_out (list) – List of elements of name to filter out
verbose (bool, optional) – If True, prints the filtered columns/keys, defaults to False
- Returns:
DataFrame or dict with filtered columns/keys
- Return type:
pandas.DataFrame or dict
- trxtools.methods.find_pol3_terminators(df, min_T, max_T, name_col='name', seq_col='sequence')
Finds all Pol3 terminators (T stretches) of given length in a string (i.e. DNA sequence) and reports their positions. Wrapper for DNA_string_positions()
- Parameters:
df (pandas.DataFrame) – Input dataframe
min_T (int) – minimum length of T stretch to look for
min_T – maximum length of T stretch to look for
region_col (str, optional) – name of column containing sequence names, defaults to ‘region’
seq_col (str, optional) – name of column containing sequences, defaults to ‘sequence’
- Returns:
Dataframe in long format containing positionsof found string
- Return type:
pandas.DataFrame
- trxtools.methods.groupCRACsamples(df=<class 'pandas.core.frame.DataFrame'>, use='protein', toDrop=[])
Parse CRAC names and annotates them using one of the following features: [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’, ‘sampleRep’]
- Parameters:
df (pandas.DataFrame) – DataFrame with CRAC names as index
use (str, optional) – Feature to annotate, choose from [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’, ‘sample’, ‘sampleRep’], defaults to ‘protein’
toDrop (list, optional) – List of words in CRAC name that will qualify the sample for rejection, defaults to []
- Returns:
DataFrame with added column [‘group’]
- Return type:
pandas.DataFrame
- trxtools.methods.indexOrder(df=Empty DataFrame Columns: [] Index: [], additional_tags=[], output='root', order='b_d_e_p')
Apply expNameParser to whole DataFrame
- Parameters:
df (pandas.DataFrame) – DataFrame where names of columns are names of experiments
additional_tags (list) – List of additional tags to consider in expNameParser
output (str) – Not used in the function, kept for compatibility
order (str, optional) – Order of elements in the output, defaults to ‘b_d_e_p’
- Returns:
DataFrame with new names
- Return type:
pandas.DataFrame
- trxtools.methods.is_inside(inner_start, inner_end, outer_start, outer_end)
Check if one region is inside another string.
- Parameters:
inner_start (int) – Start position of the inner region
inner_end (int) – End position of the inner region
outer_start (int) – Start position of the outer region
outer_end (int) – End position of the outer region
- Returns:
True if the inner region is inside the outer region, False otherwise
- Return type:
bool
- Example:
>>> is_inside(1, 3, 0, 5) True >>> is_inside(1, 3, 0, 2) False
- trxtools.methods.letterContent(seq='', letter='A')
Calculates the content of a given letter in a sequence
- Parameters:
seq (str) – Sequence to calculate the content of the given letter in
letter (str) – letter to calculate the content of in the sequence, defaults to “A”
- Returns:
Content of the given letter in the sequence, round to two decimal places
- Return type:
float
- Example:
>>> letterContent("ATCG", "A") 0.25
- trxtools.methods.listPaths(folder='.', suffix='', nameElem=None)
List all file paths in a folder with a specific suffix and optional name element.
- Parameters:
folder (str, optional) – Folder to list files from, defaults to “.”
suffix (str, optional) – Suffix of the files to list, defaults to an empty string
nameElem (str, optional) – Optional element that must be in the file name, defaults to None
- Returns:
List of file paths that match the criteria
- Return type:
list
- Example:
>>> listPaths(folder=".", suffix=".txt", nameElem="file") ['file1.txt', 'file2.txt']
- trxtools.methods.loadGTF(gtf_path='')
Load GTF file into a DataFrame
- Parameters:
gtf_path (str, optional) – Path to the GTF file, defaults to “”
- Returns:
DataFrame containing GTF data
- Return type:
pandas.DataFrame
- trxtools.methods.nested_region_cleanup(df, start_col='start', end_col='end')
Helper function to remove shorter regions nested in longer ones
- Parameters:
df (pandas.DataFrame) – input dataframe
start_col (str, optional) – name of column containing region start positions, defaults to ‘start’
end_col (str, optional) – name of column containing region end positions, defaults to ‘end’
length_col (str, optional) – name of column containing region lengths, defaults to ‘length’
- Returns:
dataframe with nested regions removed
- Return type:
pandas.DataFrame
- trxtools.methods.normalize(df=<class 'pandas.core.frame.DataFrame'>, log2=False, pseudocounts=0.1, CPM=True)
Normalize the DataFrame by adding pseudocounts, dividing by the sum, and optionally applying log2 transformation and CPM scaling.
- Parameters:
df (pandas.DataFrame) – DataFrame to be normalized
log2 (bool, optional) – Whether to apply log2 transformation, defaults to False
pseudocounts (float, optional) – Value to add to each element to avoid log of zero, defaults to 0.1
CPM (bool, optional) – Whether to scale the data to counts per million, defaults to True
- Returns:
Normalized DataFrame
- Return type:
pandas.DataFrame
- trxtools.methods.parseCRACname(s1=<class 'pandas.core.series.Series'>)
- Parse CRAC name into [‘expID’, ‘expDate’, ‘protein’, ‘condition1’, ‘condition2’, ‘condition3’] using this order.
“_” is used to split the name
- Parameters:
s1 (pandas.Series) – Series containing CRAC names
- Returns:
DataFrame with parsed CRAC names
- Return type:
pandas.DataFrame
- trxtools.methods.quantileCategory(s1=Series([], dtype: float64), q=4)
Quantile-based discretization function based on pandas.qcut function.
- Parameters:
s1 (pandas.Series) – Series to be discretized
q (int, optional) – Number of quantiles, defaults to 4
- Returns:
Series with quantile categories
- Return type:
pandas.Series
- Example:
>>> quantileCategory(pd.Series([1, 2, 3, 4, 5]), q=2) 0 0 1 0 2 1 3 1 4 1 dtype: int64
- trxtools.methods.randomDNAall(length=0, letters='CGTA')
Generates all possible random sequences of a given length
- Parameters:
length (int) – Length of the sequence
letters (str) – str with letters that will be used
- Returns:
List of all possible random sequences of the given length
- Return type:
list
- Example:
>>> randomDNAall(2,"CGTA") ['CC', 'CG', 'CT', 'CA', 'GC', 'GG', 'GT', 'GA', 'TC', 'TG', 'TT', 'TA', 'AC', 'AG', 'AT', 'AA']
- trxtools.methods.randomDNAsingle(length=0, letters='CGTA')
Random generator of nucleotide sequence
- Parameters:
length (int) – Length of the sequence
letters (str) – str with letters that will be used
- Returns:
Random sequence of the given length
- Return type:
str
- Example:
>>> randomDNAsingle(10,"CGTA") 'CGTACGTAAC'
- trxtools.methods.readSalmon(nameElem='', path='', toLoad='', toClear=[], toAdd='', column='NumReads', df=None, overwrite=False)
Reads multiple Salmon quant.sf files to one DataFrame
- Parameters:
nameElem (str) – Element to load, present in all files
path (str) – Path to directory with files
toLoad (str, optional) – Additional parameter for filtering, defaults to nameElem
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
column (str, optional) – Column to extract from quant.sf file, defaults to ‘NumReads’
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False
- Returns:
DataFrame with all files, where columns are values from ‘Name’ column
- Return type:
pandas.DataFrame
- trxtools.methods.read_DEseq(p)
WARNING: Not tested properly. May not work as expected. Read DESeq2 output file and add gene names.
- Parameters:
p (str) – Path to the DESeq2 output file
- Returns:
DataFrame with DESeq2 results and gene names
- Return type:
pandas.DataFrame
- trxtools.methods.read_HTSeq_output(path='', toLoad='classes', toClear=[], toAdd='', df=None, overwrite=False)
Reads multiple HTSeq tab files to one DataFrame
- Parameters:
path (str) – Path to directory with files
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False
- Returns:
DataFrame with all files, where columns are values from ‘name’ column
- Return type:
pandas.DataFrame
- trxtools.methods.read_STARstats(path='', toClear=[], toAdd='', df=None, overwrite=False)
Reads multiple STAR Log final output files to one DataFrame
- Parameters:
path (str) – Path to directory with files
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False
- Returns:
DataFrame with all files, where columns are values from ‘name’ column
- Return type:
pandas.DataFrame
- trxtools.methods.read_featureCount(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)
Read featureCount files with common first column
- Parameters:
nameElem (str) – Element to load, present in all files
path (str) – Path to directory with files
toLoad (str, optional) – Additional parameter for filtering, defaults to nameElem
toClear (list) – List of strings to be removed from file names
toAdd (str) – String to be added to file names
df (pandas.DataFrame, optional) – DataFrame to be appended; default=None
overwrite (bool) – Boolean, allows for overwriting during appending, default=False
- Returns:
DataFrame with all files, where columns are values from ‘name’ column
- Return type:
pandas.DataFrame
- trxtools.methods.read_list(filepath='')
Read list from file. Each row becomes item in the list.
- Parameters:
filepath (str) – Path to the file
- Returns:
List of items from the file
- Return type:
list
- Example:
>>> read_list("file.txt") ['item1', 'item2']
- trxtools.methods.read_tabFile(nameElem='', path='', toLoad='', toClear=[], toAdd='', df=None, overwrite=False)
Read tab files with common first column
- Parameters:
nameElem (str) – part of a filename which is present in all files
path (str) – path to directory with files
toLoad (str) – to be present in file name
toClear (str) – part of filename to be removed from file name
toAdd (str) – string to be added to file name
df (pandas.DataFrame) – DataFrame, to be appended; default=None
overwrite (bool) – boolean, allows for overwriting during appending, default = False
- Returns:
dataframe with all files, where columns are values from ‘name’ column
- Return type:
pandas.DataFrame
- trxtools.methods.reverse_complement(seq)
Reverse complement of DNA or RNA sequence. Identifies RNA by presence of ‘U’ in the sequence.
- Parameters:
seq (str) – DNA or RNA sequence to reverse complement
- Returns:
Reverse complement of the input sequence
- Return type:
str
- Example:
>>> reverse_complement("AUCG") 'CGAU'
- trxtools.methods.reverse_complement_DNA(seq)
Reverse complement of DNA sequence
- Parameters:
seq (str) – DNA sequence to reverse complement
- Returns:
Reverse complement of the input sequence
- Return type:
str
- Example:
>>> reverse_complement_DNA("ATCG") 'CGAT'
- trxtools.methods.reverse_complement_RNA(seq)
Reverse complement of RNA sequence
- Parameters:
seq (str) – RNA sequence to reverse complement
- Returns:
Reverse complement of the input sequence
- Return type:
str
- Example:
>>> reverse_complement_RNA("AUCG") 'CGAU'
- trxtools.methods.rollingGC(s=<class 'pandas.core.series.Series'>, window=10)
Calculates GC from sequence, uses ‘boxcar’ window
- Parameters:
s (pandas.Series) – Series containing sequence
window (int) – window size for GC calculation
- Returns:
Series with GC calculated, center=False
- Return type:
pandas.Series
- trxtools.methods.runPCA(data=Empty DataFrame Columns: [] Index: [], n_components=2)
Run PCA analysis and re-assigns column names and index names
- Parameters:
data (pandas.DataFrame) – DataFrame containing the input data
n_components (int, optional) – Number of principal components to compute, defaults to 2
- Returns:
Tuple consisting of DataFrame with PCA results and a list of explained variance ratios
- Return type:
tuple
- Example:
>>> runPCA(pd.DataFrame([[1, 2], [3, 4], [5, 6]]), n_components=2) ( PC1 PC2 0 -4.242641 0.000000 1 0.000000 0.000000 2 4.242641 0.000000, [100.0, 0.0])
- trxtools.methods.timestamp()
Returns current timestamp as a string
- Returns:
timestamp in a format ‘YYYYMMDD_HHMMSS’
- Return type:
str
- trxtools.methods.timestampRandomInt()
Returns current timestamp with a random integer as a string
- Returns:
timestamp in a format ‘YYYYMMDD_HHMMSS_RANDOMINT’
- Return type:
str
trxtools.nascent module
- class trxtools.nascent.Fold(tempDir=None)
Bases:
object
- RNAfold(data, saveData=False, temp=None)
Calculates dG using RNAfold (ViennaRNA).
- Parameters:
data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.
- Returns:
DataFrame with calculated dG values.
- Return type:
pd.DataFrame
- RNAinvert(structure='', saveData=False, temp=None, n=5, RNAprimer='', stall='', quick=False)
Returns n sequences with a given structure.
- Parameters:
structure (str) – Secondary RNA structure.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.
n (int) – Number of output sequences, default is 5.
RNAprimer (str) – RNA primer sequence.
stall (str) – Nucleotide sequence to avoid.
quick (bool) – If False, uses -Fmp -f 0.01 params, default is False.
- Returns:
DataFrame with generated sequences.
- Return type:
pd.DataFrame
- RNAinvertStall(structure='', RNAprimer='', stall='A', n=200)
Returns sequences without the nucleotide that is present in stall.
- Parameters:
structure (str) – Secondary RNA structure.
RNAprimer (str) – RNA primer sequence.
stall (str) – Nucleotide sequence to avoid, default is “A”.
n (int) – Number of sequences to generate, default is 200.
- Returns:
DataFrame with filtered sequences.
- Return type:
pd.DataFrame
- UNAfold(data, saveData=False, temp=None)
Calculates dG using UNAfold.
- Parameters:
data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for folding, default is None.
- Returns:
DataFrame with calculated dG values.
- Return type:
pd.DataFrame
- bashFolding(method='RNA')
Runs RNA folding using bash.
- Parameters:
method (str) – Folding method, “RNA” for ViennaRNA or “UNA” for UNAfold.
- Returns:
None
- class trxtools.nascent.Hybrid(tempDir=None)
Bases:
object
- RNAhybrid(data, saveData=False, temp=None)
Calculates dG using hybrid-min. This function calculates the hybridization energy (dG) between RNA sequences using the hybrid-min tool.
- Parameters:
data ({list, pd.Series, pd.DataFrame}) – Input data, can be a list, Series, or DataFrame with “name” column.
saveData (bool) – If True, saves the data, default is False.
temp (int) – Temperature for hybridization, default is None.
- Returns:
DataFrame with calculated dG values.
- Return type:
pd.DataFrame
- Example:
>>> hybrid = Hybrid() >>> data = ["AGCUAGUCA", "CGAUCGUAG"] >>> hybrid.RNAhybrid(data)
- bashHybrid()
Runs hybrid-min using bash. This function executes the hybrid-min command using bash to calculate the hybridization energy between two sequences.
- Returns:
None
- trxtools.nascent.extendingWindow(sequence='', name='name', strand='plus', temp=30, m=7, toAdd=0)
Returns DataFrame of sequences of all possible lengths between minimum (m) and length of input sequence -1.
- Parameters:
sequence (str) – str, input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “minus”}, default “plus”
temp (int) – temperature, default 30
m (int) – minimum length of sequence, default 7
toAdd (int) – additional length to add, default 0
- Returns:
DataFrame of sequences with names according to nascent.slidingWindow convention
- Return type:
pd.DataFrame
- trxtools.nascent.foldNascentElem(data=Empty DataFrame Columns: [] Index: [])
Fold the very 3’ of nascent elements.
- Parameters:
data (pd.DataFrame) – DataFrame containing folded sequences.
- Returns:
DataFrame with additional columns for nascent element folding energies.
- Return type:
pd.DataFrame
- trxtools.nascent.handleInput(data, keepNames=True)
Input data with columns: “seq” or “sequence” and “name” (optional).
- Parameters:
data ({str, list, pd.Series, pd.DataFrame}) – input data {str, list, Series, DataFrame}
keepNames (bool) – if True use given names, default True
- Returns:
DataFrame where index become name of sequence to fold
- Return type:
pd.DataFrame
- trxtools.nascent.join2d(df=Empty DataFrame Columns: [] Index: [], use='format')
Joins 2D data from RNA folding results.
- Parameters:
df (pd.DataFrame) – DataFrame containing RNA folding results.
use (str) – Column to use for joining, default is ‘format’.
- Returns:
Dictionary of DataFrames, one for each gene/chromosome.
- Return type:
dict
- trxtools.nascent.merge2d(df=<class 'pandas.core.frame.DataFrame'>)
Merges 2D data from RNA folding results.
- Parameters:
df (pd.DataFrame) – DataFrame containing RNA folding results.
- Returns:
DataFrame with statistical description of the merged data.
- Return type:
pd.DataFrame
- trxtools.nascent.name2index(s1=Series([], dtype: object))
Extracts position from sequence name.
- Parameters:
s1 (pd.Series) – Series with names from prepareNascent function
- Returns:
Series with positions
- Return type:
pd.Series
- trxtools.nascent.nascentElems(vienna='', sequence='')
Describe elements of secondary structure: stems, multistems.
- Parameters:
vienna (str) – Vienna RNA secondary structure notation.
sequence (str) – RNA sequence.
- Returns:
Tuple containing the sequence of the last nascent element and its distance from the end.
- Return type:
tuple
- trxtools.nascent.nascentElemsDataFrame(data=Empty DataFrame Columns: [] Index: [])
Apply nascentElems to DataFrame of folded sequences.
- Parameters:
data (pd.DataFrame) – DataFrame containing folded sequences with ‘vienna’ and ‘sequence’ columns.
- Returns:
DataFrame with additional columns for nascent element sequences and distances.
- Return type:
pd.DataFrame
- trxtools.nascent.nascentFolding(sequence='', temp=30, window=100)
Combines folding function: fold RNA, locate last nascent element and calculate dG of it.
- Parameters:
sequence (str) – RNA sequence.
temp (int) – Folding temperature, default is 30.
window (int) – Window size for folding, default is 100.
- Returns:
DataFrame with folding results and nascent element information.
- Return type:
pd.DataFrame
- trxtools.nascent.parseFoldingName(df=Empty DataFrame Columns: [] Index: [])
Parse folding names into separate columns.
- Parameters:
df (pd.DataFrame) – DataFrame containing folding names.
- Returns:
DataFrame with parsed columns.
- Return type:
pd.DataFrame
- trxtools.nascent.prepareNascent(sequence='', name='name', strand='plus', temp=30, window=100)
Divide long transcript into short sequences. Combines output of extendingWindow and slidingWindow.
- Parameters:
sequence (str) – input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “minus”}, default “plus”
temp (int) – temperature, default 30
window (int) – window size, default 100
- Returns:
DataFrame with sequences
- Return type:
pd.DataFrame
- trxtools.nascent.slidingWindow(sequence='', name='name', strand='plus', temp=30, window=100)
Slices sequence using sliding window.
- Parameters:
sequence (str) – input sequence
name (str) – name of the sequence, default “name”
strand (str) – strand type {“plus”, “both”, “minus”}, default “plus”
temp (int) – temperature, default 30
window (int) – window size, default 100
- Returns:
DataFrame with sliding windows
- Return type:
pd.DataFrame
trxtools.plotting module
- trxtools.plotting.GOterm(df, x='fdr', y='term.label', normalizer='pValue', cutoff=0.05, count=20, figsize=(4, 4), dpi=300, title=None, fname=None)
Generate a bar plot for GO terms based on significance.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing GO term data.
x (str, optional) – The column name for the x-axis values (e.g., FDR). Defaults to ‘fdr’.
y (str, optional) – The column name for the y-axis values (e.g., term label). Defaults to ‘term.label’.
normalizer (str, optional) – The column name used for normalization (e.g., p-value). Defaults to ‘pValue’.
cutoff (float, optional) – The cutoff value for significance. Defaults to 0.05.
count (int, optional) – The number of top terms to display. Defaults to 20.
figsize (tuple, optional) – The figure size. Defaults to (4, 4).
dpi (int, optional) – The resolution of the saved figure. Defaults to 300.
title (str, optional) – The title of the plot. Defaults to None.
fname (str, optional) – The file path to save the figure. Defaults to None.
- Returns:
None
- Example:
>>> GOterm(df, x='fdr', y='term.label', normalizer='pValue', cutoff=0.05, count=20, title='GO Term Plot')
- trxtools.plotting.PCA(data=Empty DataFrame Columns: [] Index: [], names=[], title='', PClimit=1, figsize=(7, 7), PCval=[])
Plot PCA plot.
- Parameters:
data (pd.DataFrame) – DataFrame containing PCA results.
names (list, optional) – List of names to annotate.
title (str, optional) – Title of the plot.
PClimit (int, optional) – Number of principal components to plot. Defaults to 1.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 7).
PCval (list, optional) – List of explained variance for each principal component. Defaults to an empty list.
- Returns:
None
- Example:
>>> PCA(data, names=['Sample1', 'Sample2'], title='PCA Plot', PClimit=2)
- trxtools.plotting.boxplot1(data, labels=None, title='', figsize=(7, 6), dpi=150, log=False, lim=None, name_cmap='Spectral', vert=1, color=None, grid=False, fname=None)
Generate a box plot.
- Parameters:
data (list or pd.DataFrame or pd.Series) – Data to plot.
labels (list, optional) – Labels for the data. Defaults to None.
title (str, optional) – Title of the plot. Defaults to an empty string.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 6).
dpi (int, optional) – Resolution of the figure. Defaults to 150.
log (bool, optional) – Whether to use a logarithmic scale. Defaults to False.
lim (tuple, optional) – Limits for the y-axis. Defaults to None.
name_cmap (str, optional) – Name of the colormap to use. Defaults to ‘Spectral’.
vert (int, optional) – Whether to plot the boxes vertically. Defaults to 1.
color (str, optional) – Color of the boxes. Defaults to None.
grid (bool, optional) – Whether to display a grid. Defaults to False.
fname (str, optional) – File path to save the figure. Defaults to None.
- Returns:
None
- Example:
>>> boxplot1(data, labels=['A', 'B', 'C'], title='Box Plot', log=True)
- trxtools.plotting.clusterClusterMap(df)
Generate a clustermap for clusters.
- Parameters:
df (pd.DataFrame) – DataFrame containing cluster data.
- Returns:
None
- Example:
>>> clusterClusterMap(df)
- trxtools.plotting.cumulativeDifference(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, fill_between=True, title='', start=None, stop=None, window=50, figsize=(4, 3), equal_weight=False, color1='green', lc='red', fname=None, use='mean', ylim=None)
Plot cumulative differences for peaks metaplot.
- Parameters:
ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df2 (pd.DataFrame) – DataFrame containing the second dataset.
local_pos (list) – List of positions of features (peaks/troughs).
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
fill_between (bool, optional) – Whether to fill the area between the plot and the x-axis. Defaults to True.
title (str, optional) – Title of the plot. Defaults to an empty string.
start (int, optional) – Start position for filtering local positions. Defaults to None.
stop (int, optional) – Stop position for filtering local positions. Defaults to None.
window (int, optional) – Window size around each feature. Defaults to 50.
figsize (tuple, optional) – Size of the figure. Defaults to (4, 3).
equal_weight (bool, optional) – Whether to normalize the data to equal weight. Defaults to False.
color1 (str, optional) – Color for the difference plot. Defaults to ‘green’.
lc (str, optional) – Color for the vertical line at position 0. Defaults to ‘red’.
fname (str, optional) – File path to save the figure. Defaults to None.
use (str, optional) – Method to aggregate data (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
ylim (tuple, optional) – Limits for the y-axis. Defaults to None.
- Returns:
None
- Example:
>>> cumulativeDifference('reference.csv', df2, local_pos=[100, 200, 300], title='Cumulative Differences')
- trxtools.plotting.cumulativePeaks(ref, df2=Empty DataFrame Columns: [] Index: [], local_pos=[], dpi=150, title='', start=None, stop=None, window=50, figsize=(4, 3), equal_weight=False, color1='green', color2='magenta', lc='red', fname=None, use='mean', ylim=None)
Plot cumulative peaks metaplot for a single gene.
- Parameters:
ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df2 (pd.DataFrame) – DataFrame containing the second dataset.
local_pos (list) – List of positions of features (peaks/troughs).
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
title (str, optional) – Title of the plot. Defaults to an empty string.
start (int, optional) – Start position for filtering local positions. Defaults to None.
stop (int, optional) – Stop position for filtering local positions. Defaults to None.
window (int, optional) – Window size around each feature. Defaults to 50.
figsize (tuple, optional) – Size of the figure. Defaults to (4, 3).
equal_weight (bool, optional) – Whether to normalize the data to equal weight. Defaults to False.
color1 (str, optional) – Color for the reference dataset plot. Defaults to ‘green’.
color2 (str, optional) – Color for the second dataset plot. Defaults to ‘magenta’.
lc (str, optional) – Color for the vertical line at position 0. Defaults to ‘red’.
fname (str, optional) – File path to save the figure. Defaults to None.
use (str, optional) – Method to aggregate data (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
ylim (tuple, optional) – Limits for the y-axis. Defaults to None.
- Returns:
None
- Example:
>>> cumulativePeaks('reference.csv', df2, local_pos=[100, 200, 300], title='Cumulative Peaks')
- trxtools.plotting.enhancedVolcano(df, x_fc='log2FoldChange', y_pval='padj', pval_cutoff=0.01, fc_cutoff=2, plot_n=True, labels=True, xlim=None, ylim=None, figsize=(4, 4), dpi=300, title=None, save=None)
Generate an enhanced volcano plot based on DataFrame values.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing the data.
x_fc (str, optional) – The column name for the x-axis values (fold change). Defaults to ‘log2FoldChange’.
y_pval (str, optional) – The column name for the y-axis values (adjusted p-value). Defaults to ‘padj’.
pval_cutoff (float, optional) – The p-value cutoff for significance. Defaults to 0.01.
fc_cutoff (float, optional) – The fold change cutoff for significance. Defaults to 2.
plot_n (bool, optional) – Whether to plot non-significant points. Defaults to True.
labels (bool, optional) – Whether to add labels to significant points. Defaults to True.
xlim (tuple, optional) – The x-axis limits. Defaults to None.
ylim (tuple, optional) – The y-axis limits. Defaults to None.
figsize (tuple, optional) – The figure size. Defaults to (4, 4).
dpi (int, optional) – The resolution of the saved figure. Defaults to 300.
title (str, optional) – The title of the plot. Defaults to None.
save (str, optional) – The file path to save the figure. Defaults to None.
- Returns:
None
- Example:
>>> enhancedVolcano(df, x_fc='log2FoldChange', y_pval='padj', pval_cutoff=0.01, fc_cutoff=2, title='Volcano Plot')
- trxtools.plotting.generateSubplots(dataframes, figsize=(5, 3), dpi=300, save=None)
Generate subplots for multiple dataframes.
- Parameters:
dataframes (list) – A list of dataframes to plot. Each element is a list of lists containing title, dataframe, and optional colormap.
figsize (tuple, optional) – The size of the figure. Defaults to (5, 3).
dpi (int, optional) – The resolution of the figure in dots per inch. Defaults to 300.
save (str, optional) – The file path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> df1 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}) >>> df2 = pd.DataFrame({'x': [7, 8, 9], 'y': [10, 11, 12]}) >>> generateSubplots([ >>> [['Title 1', df1, None], >>> ['Title 2', df2, None]] >>> ], figsize=(10, 6), dpi=150, save='plot.png')
- trxtools.plotting.hplotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot the average input read length as a horizontal bar plot.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> hplotSTARstats_readLen(df)
- trxtools.plotting.hplotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot the number of reads after pre-processing as a horizontal bar plot.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> hplotSTARstats_reads(df)
- trxtools.plotting.metaprofileAndHeatmap(data_metaplot, data_heatmap, subsets={'title': None}, agg_type='mean', normalize_internal=True, differential_heatmap=True, figsize=(5, 3), dpi=300, save=None, bins=[50, 10, 50])
Generate metaprofile and heatmap plots for multiple subsets.
- Parameters:
data_metaplot (dict) – Data for metaprofile plot.
data_heatmap (dict) – Data for heatmap plot.
subsets (dict, optional) – Dictionary of subsets with titles as keys and subset data as values. Defaults to {“title”: None}.
agg_type (str, optional) – Aggregation type for metaprofile (‘mean’, ‘median’, ‘sum’). Defaults to ‘mean’.
normalize_internal (bool, optional) – Whether to normalize data internally. Defaults to True.
differential_heatmap (bool, optional) – Whether to plot differential heatmap. Defaults to True.
figsize (tuple, optional) – The size of the figure. Defaults to (5, 3).
dpi (int, optional) – The resolution of the figure in dots per inch. Defaults to 300.
save (str, optional) – The file path to save the figure. If not provided, the figure will be displayed.
bins (list, optional) – List of bin sizes for gene scheme. Defaults to [50, 10, 50].
- Returns:
None
- Example:
>>> metaprofileAndHeatmap(data_metaplot, data_heatmap, subsets={"Subset 1": subset1, "Subset 2": subset2}, >>> agg_type='mean', normalize_internal=True, differential_heatmap=True, >>> figsize=(10, 6), dpi=150, save='plot.png', bins=[50, 10, 50])
- trxtools.plotting.plotAndFolding(df=Empty DataFrame Columns: [] Index: [], dG=Series([], dtype: float64), title='', start=None, stop=None, legend=True, figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', name='median', h_lines=[], lc='red', offset=0, fname=None)
Plot a figure similar to a box plot with additional delta G values.
- Parameters:
df (pd.DataFrame) – DataFrame containing the data with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
dG (pd.Series) – Series containing delta G values.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
legend (bool, optional) – Whether to display the legend. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
color (str, optional) – Color for the median line. Defaults to ‘green’.
name (str, optional) – Name for the median line in the legend. Defaults to ‘median’.
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> plotAndFolding(df, dG, title="Plot with Delta G", start=10, stop=90, color='blue', name='median', h_lines=[20, 40, 60], lc='black', offset=5, fname='plot.png')
- trxtools.plotting.plotSTARstats(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot all STAR statistics including reads, read length, mapping, mismatches, and chimeric reads.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats(df)
- trxtools.plotting.plotSTARstats_chimeric(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot chimeric reads and splices.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats_chimeric(df)
- trxtools.plotting.plotSTARstats_mapping(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot STAR mapping statistics.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats_mapping(df)
- trxtools.plotting.plotSTARstats_mistmatches(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot mapping errors including mismatches, deletions, and insertions.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats_mistmatches(df)
- trxtools.plotting.plotSTARstats_readLen(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot the average input read length.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats_readLen(df)
- trxtools.plotting.plotSTARstats_reads(df=Empty DataFrame Columns: [] Index: [], dpi=150)
Plot the number of reads after pre-processing.
- Parameters:
df (pd.DataFrame) – DataFrame containing STAR statistics.
dpi (int, optional) – Resolution of the plot in dots per inch. Defaults to 150.
- Returns:
None
- Example:
>>> plotSTARstats_reads(df)
- trxtools.plotting.plot_as_box_plot(df=Empty DataFrame Columns: [] Index: [], title='', start=None, stop=None, name='median', figsize=(7, 3), ylim=(None, 0.01), dpi=150, color='green', h_lines=[], lc='red', offset=0, fname=None)
Plot a figure similar to a box plot: median, 2nd and 3rd quartiles, and min-max range.
- Parameters:
df (pd.DataFrame) – DataFrame containing the data with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
name (str, optional) – Name for the median line in the legend. Defaults to ‘median’.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
color (str, optional) – Color for the median line. Defaults to ‘green’.
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> df = pd.DataFrame({'position': range(100), 'median': np.random.rand(100), 'q1': np.random.rand(100), 'q3': np.random.rand(100), 'min': np.random.rand(100), 'max': np.random.rand(100)}) >>> plot_as_box_plot(df, title="Box Plot", start=10, stop=90, name='median', figsize=(10, 5), ylim=(0, 1), dpi=200, color='blue', h_lines=[20, 40, 60], lc='black', offset=5, fname='box_plot.png')
- trxtools.plotting.plot_diff(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm', label1='reference', label2='', title='', start=None, stop=None, plot_medians=True, plot_ranges=True, legend=True, dpi=150, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', offset=0, fname=None)
Plot given dataset and reference, highlighting differences.
- Parameters:
ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
dataset (pd.DataFrame) – DataFrame containing the data to compare with columns: [‘position’, ‘mean’, ‘median’, ‘std’] and optionally [‘nucleotide’, ‘q1’, ‘q3’, ‘max’, ‘min’].
ranges (str, optional) – Type of range to highlight (‘mm’ for min-max or ‘qq’ for q1-q3). Defaults to ‘mm’.
label1 (str, optional) – Label for the reference dataset in the legend. Defaults to “reference”.
label2 (str, optional) – Label for the comparison dataset in the legend.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
plot_medians (bool, optional) – Whether to plot medians. Defaults to True.
plot_ranges (bool, optional) – Whether to plot ranges. Defaults to True.
legend (bool, optional) – Whether to display the legend. Defaults to True.
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 0.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> plot_diff('reference.csv', dataset, ranges='qq', label1='Reference', label2='Comparison', title='Difference Plot', start=10, stop=90, plot_medians=True, plot_ranges=True, h_lines=[20, 40, 60], lc='black', offset=5, fname='difference_plot.png')
- trxtools.plotting.plot_heatmap(df=Empty DataFrame Columns: [] Index: [], title='Heatmap of differences between dataset and reference plot for RDN37-1', vmin=None, vmax=None, figsize=(20, 10), dpi=300, fname=None)
Plot heatmap of differences between dataset and reference.
- Parameters:
df (pd.DataFrame) – DataFrame containing the differences.
title (str, optional) – Title of the plot. Defaults to ‘Heatmap of differences between dataset and reference plot for RDN37-1’.
vmin (float, optional) – Minimum value for the colormap. Defaults to None.
vmax (float, optional) – Maximum value for the colormap. Defaults to None.
figsize (tuple, optional) – Size of the figure. Defaults to (20, 10).
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 300.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> plot_heatmap(df, title='Difference Heatmap', vmin=-1, vmax=1, figsize=(15, 8), dpi=200, fname='heatmap.png')
- trxtools.plotting.plot_to_compare(ref, df=Empty DataFrame Columns: [] Index: [], color1='green', color2='black', legend=True, ref_label='', label='', title='', start=None, stop=None, figsize=(7, 3), ylim=(None, 0.01), h_lines=[], lc='red', dpi=150, offset=300, fname=None)
Plot a figure to compare two datasets similar to a box plot.
- Parameters:
ref (str or pd.DataFrame) – Path to CSV file or DataFrame containing reference data.
df (pd.DataFrame) – DataFrame containing the data to compare.
color1 (str, optional) – Color for the reference dataset plot. Defaults to ‘green’.
color2 (str, optional) – Color for the comparison dataset plot. Defaults to ‘black’.
ref_label (str, optional) – Label for the reference dataset in the legend.
label (str, optional) – Label for the comparison dataset in the legend.
title (str, optional) – Title of the plot.
start (int, optional) – Start position for the plot.
stop (int, optional) – Stop position for the plot.
legend (bool, optional) – Whether to display the legend. Defaults to True.
figsize (tuple, optional) – Size of the figure. Defaults to (7, 3).
ylim (tuple, optional) – Limits for the y-axis. Defaults to (None, 0.01).
h_lines (list, optional) – List of horizontal lines to add to the plot.
lc (str, optional) – Color for the horizontal lines. Defaults to ‘red’.
dpi (int, optional) – Resolution of the figure in dots per inch. Defaults to 150.
offset (int, optional) – Number to offset position if 5’ flank was used. Defaults to 300.
fname (str, optional) – File path to save the figure. If not provided, the figure will be displayed.
- Returns:
None
- Example:
>>> plot_to_compare('reference.csv', df, color1='blue', color2='red', ref_label='Reference', label='Comparison', title='Comparison Plot', start=10, stop=90, h_lines=[20, 40, 60], lc='black', offset=5, fname='comparison_plot.png')
- trxtools.plotting.select_colors(n=0, name_cmap='Spectral')
Select a list of colors from a colormap.
- Parameters:
n (int, optional) – Number of colors to select, defaults to int()
name_cmap (str, optional) – Name of the colormap to use, defaults to ‘Spectral’
- Returns:
List of colors
- Return type:
list
- trxtools.plotting.vennDiagram(*dataframes, labels=['df1', 'df2', 'df3'], colors=('skyblue', 'lightgreen', 'lightpink'), title=None, save=None)
Generate a Venn diagram for 2 or 3 DataFrames.
- Parameters:
dataframes (list of pd.DataFrame) – Input DataFrames containing the data.
labels (list of str, optional) – Labels for the DataFrames. Defaults to [‘df1’,’df2’,’df3’].
colors (tuple of str, optional) – Colors for the Venn diagram. Defaults to (‘skyblue’, ‘lightgreen’, ‘lightpink’).
title (str, optional) – The title of the plot. Defaults to None.
save (str, optional) – The file path to save the figure. Defaults to None.
- Returns:
None
- Example:
>>> vennDiagram(df1, df2, labels=['Group 1', 'Group 2'], title='Venn Diagram')
trxtools.profiles module
- trxtools.profiles.FoldingFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, ranges=0, offset=15, fold='dG65nt@30C')
Pulls folding information from BigWig folding data for a given gene.
- Parameters:
gene_name (str) – str, name of the gene
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object with GTF and TAB files loaded
bwFWD (dict) – dict of pyBigWig objects for forward strand
bwREV (dict) – dict of pyBigWig objects for reverse strand
ranges (int) – int, flanks to be added for the gene, default 0
offset (int) – int, offset for folding data, default 15
fold (str) – str, name of output column, default “dG65nt@30C”
- Returns:
DataFrame with folding information
- Return type:
pd.DataFrame
- Example:
>>> FoldingFromBigWig('gene1', gtf_object, bwFWD_dict, bwREV_dict)
- trxtools.profiles.binCollect3(s1=Series([], dtype: float64), lengths=[500, 72, 500], bins=[50, 10, 50])
Collects and sums data from a series into bins of specified lengths.
- Parameters:
s1 (pd.Series) – Series containing the input data
lengths (list) – List of lengths for each bin
bins (list) – List of number of bins for each length
- Returns:
Series with the summed data from the bins
- Return type:
pd.Series
- Example:
>>> binCollect3(s1, lengths=[500,72,500], bins=[50, 10, 50])
- trxtools.profiles.calculateFDR(data=Series([], dtype: float64), iterations=100, target_FDR=0.05)
Calculates False Discovery Rate (FDR) for a given dataset.
- Parameters:
data (pd.Series) – Series containing the input data
iterations (int) – Number of iterations for random dataset generation, default 100
target_FDR (float) – Target False Discovery Rate, default 0.05
- Returns:
Series with FDR applied
- Return type:
pd.Series
- Example:
>>> calculateFDR(data, iterations=100, target_FDR=0.05)
- trxtools.profiles.compare1toRef(ref, dataset=Series([], dtype: float64), ranges='mm', heatmap=False, relative=False)
Takes Series and compare this with reference DataFrame.
- Parameters:
ref (str or pd.DataFrame) – Path to csv file or DataFrame
dataset (pd.Series) – Series containing the dataset to compare
ranges (str) – ‘mm’ for min-max or ‘qq’ for q1-q3, default ‘mm’
heatmap (bool) – If True, return Series of differences to plot heatmap, default False
relative (bool) – If True, recalculate differences according to the peak size, default False
- Returns:
DataFrame (heatmap=False) or Series (heatmap=True)
- Return type:
pd.DataFrame or pd.Series
- Example:
>>> compare1toRef(ref, dataset, ranges='mm', heatmap=False, relative=False)
- trxtools.profiles.compareMoretoRef(ref, dataset=Empty DataFrame Columns: [] Index: [], ranges='mm')
Compares a DataFrame created by filter_df with a reference DataFrame.
- Parameters:
ref (str or pd.DataFrame) – Path to csv file or DataFrame containing reference data
dataset (pd.DataFrame) – DataFrame containing the dataset to compare
ranges (str) – ‘mm’ for min-max or ‘qq’ for q1-q3, default ‘mm’
- Returns:
DataFrame with comparison results
- Return type:
pd.DataFrame
- Example:
>>> compareMoretoRef(ref, dataset, ranges='mm')
- trxtools.profiles.dictBigWig(files=[], path='', strands=True)
Preload BigWig files to memory using pyBigWig tools.
- Parameters:
files (list) – list of str, filenames
path (str) – str, path to the files
strands (bool) – bool, if True handle strand-specific data, default True
- Returns:
dict or tuple of dicts, BigWig objects
- Return type:
dict or tuple
- Example:
>>> dictBigWig(['sample1_fwd.bw', 'sample1_rev.bw'], '/path/to/files') ({'sample1': <pyBigWig object>}, {'sample1': <pyBigWig object>})
- trxtools.profiles.findPeaks(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)
Find local extrema using SciPy argrelextrema function.
- Parameters:
s1 (pd.Series) – Series data to localize peaks
window (int) – To smooth data before peak-calling, default 1 (no smoothing)
win_type (str) – Type of smoothing window, default “blackman”
order (int) – Minimal spacing between peaks, argrelextrema order parameter, default 20
- Returns:
List of peaks
- Return type:
list
- Example:
>>> findPeaks(s1, window=1, win_type='blackman', order=20)
- trxtools.profiles.findTroughs(s1=Series([], dtype: float64), window=1, win_type='blackman', order=20)
Find local minima using SciPy argrelextrema function.
- Parameters:
s1 (pd.Series) – Series data to localize troughs
window (int) – To smooth data before trough-calling, default 1 (no smoothing)
win_type (str) – Type of smoothing window, default “blackman”
order (int) – Minimal spacing between minima, argrelextrema order parameter, default 20
- Returns:
List of troughs
- Return type:
list
- Example:
>>> findTroughs(s1, window=1, win_type='blackman', order=20)
- trxtools.profiles.geneFromBigWig(gene_name, gtf, bwFWD={}, bwREV={}, toStrip='', ranges=0)
Pull genome coverage from BigWig data for a given gene.
- Parameters:
gene_name (str) – str, name of the gene
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object, GTF and TAB files loaded
bwFWD (dict) – dict, pyBigWig objects for forward strand
bwREV (dict) – dict, pyBigWig objects for reverse strand
toStrip (str) – str, name to be stripped
ranges (int) – int, flanks to be added for the gene, default 0
- Returns:
DataFrame, genome coverage data
- Return type:
pd.DataFrame
- Example:
>>> geneFromBigWig('gene1', gtf_object, bwFWD_dict, bwREV_dict)
- trxtools.profiles.ntotal(df=<class 'pandas.core.frame.DataFrame'>, drop=True)
Normalize data in DataFrame to fraction of total column.
- Parameters:
df (pd.DataFrame) – DataFrame to be normalized
drop (bool) – If True, drop ‘position’ and ‘nucleotide’ columns, default True
- Returns:
Normalized DataFrame
- Return type:
pd.DataFrame
- Example:
>>> ntotal(df, drop=True)
- trxtools.profiles.parseConcatFile(path, gtf, use='reads', RPM=False, ranges=1000)
Parse concat file to extract gene data.
- Parameters:
path (str) – str, path of the concat file
gtf (pyCRAC.GTF2) – pyCRAC.GTF2 object, GTF and TAB files loaded
use (str) – str, column name to use [‘reads’, ‘substitutions’, ‘deletions’], default “reads”
RPM (bool) – bool, if True use reads per million, default False
ranges (int) – int, flanks to be added for the gene, default 1000
- Returns:
dict, DataFrames using gene name as a key
- Return type:
dict
- Example:
>>> parseConcatFile('/path/to/concat/file', gtf_object)
Warning
This function is deprecated and will be removed in future versions.
- trxtools.profiles.preprocess(input_df=Empty DataFrame Columns: [] Index: [], let_in=[''], let_out=['wont_find_this_string'], stats=False, smooth=True, window=10, win_type='blackman', pseudocounts_param=True, ntotal_param=True)
Combines methods.filterExp and expStats. Returns DataFrame with chosen experiments, optionally apply smoothing and stats.
- Parameters:
input_df (pd.DataFrame) – DataFrame containing the input data
let_in (list) – List of words that characterize experiment, default [‘’]
let_out (list) – List of words that disqualify experiments, default [‘wont_find_this_string’]
stats (bool) – If True, return stats for all experiments, default False
smooth (bool) – If True, apply smoothing window, default True
window (int) – Smoothing window size, default 10
win_type (str) – Type of smoothing window, default “blackman”
pseudocounts_param (bool) – If True, add 0.01 pseudocounts, default True
ntotal_param (bool) – If True, apply ntotal normalization, default True
- Returns:
DataFrame with ‘mean’, ‘median’, ‘min’, ‘max’ and quartiles if more than 2 experiments
- Return type:
pd.DataFrame
- Example:
>>> preprocess(input_df, let_in=['exp1'], let_out=['exp2'], stats=True, smooth=True, window=10, win_type='blackman', pseudocounts_param=True, ntotal_param=True)
- trxtools.profiles.pseudocounts(df=<class 'pandas.core.frame.DataFrame'>, value=0.01, drop=True)
Add pseudocounts to data to avoid zero values.
- Parameters:
df (pd.DataFrame) – DataFrame to which pseudocounts will be added
value (float) – Value of the pseudocount to be added, default 0.01
drop (bool) – If True, drop ‘position’ and ‘nucleotide’ columns, default True
- Returns:
DataFrame with pseudocounts added
- Return type:
pd.DataFrame
- Example:
>>> pseudocounts(df, value=0.01, drop=True)
- trxtools.profiles.save_csv(data_ref=Empty DataFrame Columns: [] Index: [], datasets=Empty DataFrame Columns: [] Index: [], path=None)
Saves DataFrame to a CSV file.
- Parameters:
data_ref (pd.DataFrame) – DataFrame containing reference data with ‘position’ and ‘nucleotide’ columns
datasets (pd.DataFrame) – DataFrame containing experimental data
path (str, optional) – Path to save the CSV file, default None
- Returns:
DataFrame with combined reference and experimental data
- Return type:
pd.DataFrame
- Example:
>>> save_csv(data_ref, datasets, path='/path/to/save.csv')
- trxtools.profiles.stripBigWignames(files=[])
Strip “_rev.bw” and “_fwd.bw” from file names.
- Parameters:
files (list) – list of str, filenames
- Returns:
list of str, unique names
- Return type:
list
- Example:
>>> stripBigWignames(['sample1_fwd.bw', 'sample1_rev.bw']) ['sample1']
trxtools.secondary module
- trxtools.secondary.Lstem(vienna='')
Returns a list of positions where “(” is found using coordinates {1,inf}.
- Parameters:
vienna (str) – Vienna format secondary structure
- Returns:
List of positions with “(”
- Return type:
list
Example: >>> Lstem(“((..))”) [1, 2]
- trxtools.secondary.Rstem(vienna='')
Returns a list of positions where “)” is found using coordinates {1,inf}.
- Parameters:
vienna (str) – Vienna format secondary structure
- Returns:
List of positions with “)”
- Return type:
list
Example: >>> Rstem(“((..))”) [5, 6]
- trxtools.secondary.checkVienna(sequence='', vienna='')
Validates the integrity of a Vienna file by checking the sequence and structure lengths and the balance of parentheses.
- Parameters:
sequence (str) – RNA sequence
vienna (str) – Vienna format secondary structure
- Returns:
True if the Vienna file is valid, False otherwise
- Return type:
bool
- Example:
>>> checkVienna("GCAU", "((..))") False
- trxtools.secondary.loopStems(vienna='', sequence='', loopsList=None, testPrint=False)
Returns positions of stems of single hairpins and multiloop stems using coordinates {1:inf}.
- Parameters:
vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loopsList (list) – List of loop positions (optional)
testPrint (bool) – Boolean to print test output, default=False
- Returns:
List of stem positions and list of multistem positions
- Return type:
tuple
- Example:
>>> loopStems("((..))", "GCAU") ([(1, 6)], [])
- trxtools.secondary.loops(vienna='')
Returns the first positions outside the loop, e.g., “.((….)).” returns [(3,8)].
- Parameters:
vienna (str) – Vienna format secondary structure
- Returns:
List of tuples with loop positions
- Return type:
list
Example: >>> loops(“.((….)).”) [(3, 8)]
- trxtools.secondary.substructures(vienna='', sequence='')
Lists sub-structures of the given structure.
- Parameters:
vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence
- Returns:
Series of sub-structures
- Return type:
pd.Series
- Example:
>>> substructures("((..))", "GCAU") stem_1 GCAU dtype: object
- trxtools.secondary.test(vienna='', sequence='', loops=None, stems=None, multistems=None, linkers=None)
Prints the Vienna structure with given features marked.
- Parameters:
vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loops (list) – List of tuples with loop positions (optional)
stems (list) – List of tuples with stem positions (optional)
multistems (list) – List of multistem positions (optional)
linkers (list) – List of linker positions (optional)
- Returns:
None
Example: >>> test(“((..))”, “GCAU”, loops=[(3, 4)]) 1 ((..)) GCAU __O_O
- trxtools.secondary.vienna2format(vienna='', sequence='', loopsList=None, stemsList=None, multistemsList=None, testPrint=False)
Converts Vienna format to a custom format with letters: O - loop, S - stem, M - multiloop stem, and L - linker.
- Parameters:
vienna (str) – Vienna format secondary structure
sequence (str) – RNA sequence (optional)
loopsList (list) – List of loop positions (optional)
stemsList (list) – List of stem positions (optional)
multistemsList (list) – List of multistem positions (optional)
testPrint (bool) – Boolean to print test output, default=False
- Returns:
Custom format string
- Return type:
str
- Example:
>>> vienna2format("((..))", "GCAU") 'SSLLSS'