---
title: "rnaCrosslinkOO"
author: Jonathan Price
output:
  html_document:
    df_print: paged
    highlight: tango
    theme: cosmo
    toc: yes
    toc_float:
      collapsed: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{rnaCrosslinksOO}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(rmarkdown)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)


```


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# rnaCrosslinkOO

Please check out the [publication](https://doi.org/10.1093/bioinformatics/btae193) if you interesting in this tool and the [github](https://github.com/JLP-BioInf/rnaCrosslinkNF). 

# The COMRADES experiment
\
The COMRADES experimental protocol for the prediction of RNA structure in vivo was first published in 2018 (Ziv et al., 2019) where they predicted the structure of the Zika virus. The protocol has subsequently been use to predict the structure of SARS-CoV-2 (Ziv et al., 2020). Have a look to get an understanding of the protocol:

* COMRADES determines in vivo RNA structures and interactions. (2018). Omer Ziv, Marta Gabryelska, Aaron Lun, Luca Gebert. Jessica Sheu-Gruttadauria and Luke Meredith, Zhong-Yu Liu,  Chun Kit Kwok, Cheng-Feng Qin, Ian MacRae, Ian Goodfellow , John Marioni, Grzegorz Kudla, Eric Miska.  Nature Methods. Volume 15. https://doi.org/10.1038/s41592-018-0121-0   

* The Short- and Long-Range RNA-RNA Interactome of SARS-CoV-2. (2020). Omer Ziv, Jonathan Price, Lyudmila Shalamova, Tsveta Kamenova, Ian Goodfellow, Friedemann Weber, Eric A. Miska. Molecular Cell,
Volume 80
    https://doi.org/10.1016/j.molcel.2020.11.004
    

![Figure from Ziv et al., 2020. Virus-inoculated cells are crosslinked using clickable psoralen. Viral RNA is pulled down from the cell lysate using an array of biotinylated DNA probes, following digestion of the DNA probes and fragmentation of the RNA. Biotin is attached to crosslinked RNA duplexes via click chemistry, enabling pulling down crosslinked RNA using streptavidin beads. Half of the RNA duplexes are proximity-ligated, following reversal of the crosslinking to enable sequencing. The other half serves as a control, in which crosslink reversal proceeds the proximity ligation](rnaCrosslinkProtocol.jpg){width=100%}

After sequencing, short reads are produced similar to a spliced / chimeric RNA read but where one half of the read corresponds to one half of a structural RNA duplex and the other half of the reads corresponds to the other half of the structural RNA duplex. This package has been designed to analyse this data. The short reads need to be prepared in a specific way to be inputted into this package. 

There are other types of crosslinking data! 


- [Paris](https://www.nature.com/articles/s41467-021-22552-y)


---

# rnaCrosslink data pre-processing

## Nextflow pipeline 
\
Fastq files produced from the COMRADES experiment can be processed for input into rnaCrosslinkOO using the Nextflow pre-processing pipeline, to get more information visit [here](https://github.com/JLP-BioInf/rnaCrosslinkNF). The pipeline  takes the reads through trimming alignment, QC and the production of the files necessary for input to rnaCrosslinkOO. Crosslinking experiments often have different library preparation protocols therefore it is not necessary to follow the prescribed pre-processing pipeline. The only requirement is that the input files for rnaCrosslinkOO have the correct format detailed below.


## Nextflow pipeline output
\
The main output files are the files entitled *X_gapped.txt*. These are the input files for rnaCrosslinkOO. The columns of the output files are as follows:

1. Read Name
2. Read Sequence
3. NA - Blank for now
4. Side 1 transcript ID
5. Side 1 Position start in read sequence
6. Side 1 Position end in read sequence
7. Side 1 Coordinate start in transcript
8. Side 1 Coordinate end in transcript
9. NA - Blank for now
10. Side 2 transcript ID
11. Side 2 Position start in read sequence
12. Side 2 Position end in read sequence
13. Side 2 Coordinate start in transcript
14. Side 2 Coordinate end in transcript
15. NA - Blank for now


![](inputFileSchematic.jpg)


---

# Input for rnaCrosslink-OO
\
The main input files for rnaCrosslink-OO is a tab delimited text file containing the reads and mapping location on the transcriptome. This can be manually created if your library preparation protocol does not suit the pre-processing pipeline although the easiest way to obtain these files is to use the nextflow pipeline detailed above. There is test data that ships with the package, this contains data for the 18S rRNA and it's interactions with the 28S rRNA. However, full data-sets already published can be found here:[Un-enriched rRNA dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246412).

Pre-requisites:

1. Install the rnaCrosslinkOO package
2. Input files (nexflow, custom or downloaded)
3. Meta-data table
4. ID of the RNA of interest (from the transcript reference )
5. A fasta sequence of the RNA of interest  (from the transcript reference )
6. A set of interactions to compare to (optional)
6. Reactivities (optional)


---

## Load the Library

There is a development version available on [github](https://github.com/JLP-BioInf/rnaCrosslinkOO). Issue reporting and collaboration welcome. This version usually has development features not available on CRAN.

```{r, echo = T, results = 'hide',message=FALSE}
#install.packages("rnaCrosslinkOO")
# Load the rnaCrosslink-OO Library 
library(rnaCrosslinkOO)
```

The package relies on functions from these packages:

```{r, echo = T, results = 'hide',message=FALSE}
# Here are the other libraries on which rnaCrosslinkOO relies
#library(seqinr)
#library(GenomicRanges)
#library(ggplot2)
#library(reshape2)
#library(MASS)
#library(ggplot2)
#library(doParallel)
#library(igraph)
#library(R4RNA)
#library(RColorBrewer)
#library(heatmap3)
#library(mixtools)
#library(TopDom)
library(tidyverse)
#library(RRNA)
#library(ggrepel)
```


## Make the Sample table

The metadata table has 4 columns and the column names are specific and case-sensitive. 

+ **file**              -  The input file name and location of the sample input file
+ **group**             -  "c" or "s" denoting whether the sample is a control or not
+ **sample**            -  Sample number 
+ **sampleName**        -  A unique sample name 

\

```{r}
# Set up the sample table
sampleTableRow1 = c(system.file("extdata", 
                                "s1.txt", 
                                package="rnaCrosslinkOO"), "s", "1", "s1")
sampleTableRow2 = c(system.file("extdata", 
                                "c1.txt", 
                                package="rnaCrosslinkOO"), "c", "1", "c1")
sampleTable2 = rbind.data.frame(sampleTableRow1, sampleTableRow2)

# add the column names 
colnames(sampleTable2) = c("file", "group", "sample", "sampleName")

sampleTable2
```

\

## Choose RNA

The name of the RNA to analyse, this must be as it appears in the input files. 


```{r}
rna = c("ENSG000000XXXXX_NR003286-2_RN18S1_rRNA")
```

\


## Transript fasta

Fasta sequence(s) of the RNA(s) of interest, taken from the transcriptome reference fasta used for mapping. Load in using the read.fasta function from seqinr.

```{r}
path18SFata <- system.file("extdata", 
                           "18S.fasta", 
                           package="rnaCrosslinkOO")

rnaRefs = list()
rnaRefs[[rna]] = read.fasta(path18SFata)

```

\


## A set of interactions to compare to (optional)

This is optional but you can provide a table of interactions for the RNA to compare against. This can be useful when comparing different samples or to another predicted structure for the same RNA. The table should be a tsv with to columns (i and j) each row shows an interaction between nucleotide i and j for comparison.

```{r}
path18SFata <- system.file("extdata", 
                           "ribovision18S.txt", 
                           package="rnaCrosslinkOO")
known18S = read.table(path18SFata,
                      header = F)
```

\


## Reactivities (optional)

If you also have reactivities from a chemical probing experiment they can be included here as a 1 column table with one value for each position in the transcript. This feature is not ready.

```{r, eval = F}
pathShape <- system.file("extdata",
                         "reactivities.txt", 
                         package="rnaCrosslinkOO")
shape = read.table(pathShape,
                      header = F)
```

---


\


# Slow Start


The package has 3 main processes; clustering, cluster trimming and folding. The next sections take you through the usage of each of these main stages.

\

![rnacrosslinkOO](rnacrosslinkoo.jpg)
\

## rnaCrosslinkQC

Before you make the object you can check the read sizes of the duplexes and the transcripts that exist within the dataset using the `rnaCrosslinkQC` method. This is the first port of call for the analysis.
```{r, eval=FALSE}
rnaCrosslinkQC(sampleTable2,
               directory = ".", 
               topTranscripts = F)

```

Using the plot of read sizes you can choose the desired read size when creating the object. This affects the resolution of the data and the subsequent accuracy of the folding. The shorter the reads, the more accurate.It is highly recommended to choose a read size cutoff that is smaller than 60bp if you would like to do structural prediction.


Checking the output file `topTranscripts_all.txt` will allow you to see the top inter and intra RNA interactions in the dataset and grab the ID which you will need for creating the object. 


## Make the Object


The instance of the *rnaCrosslinkDataSet* object that is created stores the information from the experiment including raw and processed data for the dataset. 

### Slots / Attributes

\

##### Analysis stage 

The slots for processed and unprocessed data keep the data from each stage of the analysis, this allows the user to quickly access any part of the results. Checking the status of the object will allow you to see which stages of the analysis are present for each of the attributes.  
\

##### Meta-data

+   **rnas**                    -  The RNA ID for this instance
+   **rnaSize**                 -  The size in nuceloties of the RNA for this instance
+   **sampleTable**             -  The meta-data for the instance ( detailed above )

\

##### Raw-data and processed data 


+   **InputFiles**                -  Data in the original input file format (list)
                                InputFiles - rna ID - Analysis stage - Sample Name
+   **matrixList**              - Data in contact matrix format (list) (cells contain the number of reads assinged to those interacting coordinates)
                                InputFiles - rna ID - Analysis stage - Sample Name
+   **clusterGrangesList**      - Granges of clusters identified (list)
                                InputFiles - rna ID - Analysis stage - Sample Name
+   **clusterTableList**        - Data frame of clusters identified (list)
                                InputFiles - rna ID - Analysis stage - Sample Name
+   **clusterTableFolded**      - A table of all clusters with predicted structures included
+   **interactionTable**        - A table contraints predicted for folding
+   **viennaStructures**        - A list of predicted structures in vienna format       


\

### rnaCrosslinkDataSet - make object 


```{r}
# load the object
cds = rnaCrosslinkDataSet(rnas = rna,
                      sampleTable = sampleTable2,
                      subset = "all",
                      sample = "all")
# be aware there are extra options here
# including:
# subset - allows you to choose specific read sizes (this affects resolution and accuracy)
# sample - Choose the same ammount of reads for each sample (useful for comparisons)
```

\

## Access the data within the instance


You can check on major parts of the object and return slots and other information using the accessor methods

\

### General Acessors 

\

#### Check Status of Instance

In each slot the analysis stages available to you are shown next to the slot name. These can be used in getData to extract the data for that slot and analysis type. 

```{r}
# Check status of instance 
cds
```

\

#### Retreive RNA Size

```{r}
# Returns the size of the RNA
rnaSize(cds)
```

\

#### Retreive Meta-data

```{r}
# Returns the sample table 
sampleTable(cds)
```

\


#### Retreive Sample Groups

```{r}
# Returns indexes of the samples in the control and not control groups
group(cds)
```

\


#### Retreive Sample Names

```{r}
# Get the sample names of the instance
sampleNames(cds)
```

\


#### Retreive Data in original format 

It is more recommended to use getData for this purpose but sometime is is useful to grab all data in the InputFiles slot which contains all raw and processed data in the original input format from each analysis stage that has been performed. 

```{r, eval=FALSE,echo=TRUE}
# Return the InputFiles slot
InputFiles(cds)
```

\


#### Retreive Contact Matrices

It is more recommended to use getData for this purpose but sometime is is useful to grab all data in the matrixList slot which contains contact matrices from each analysis stage that has been performed. 

```{r, eval=FALSE, echo=TRUE}
# Return the matrixList slot 
matrixList(cds) 
```

\


### Get Data 

Get data is more generic method for retrieving data from the object and returns a list, the number of entries in the list is number of samples in the dataset and the list contain entries of the data type and analysis stage you select.


```{r}
data = getData(x    = cds,              # The object      
               data =  "InputFiles",       # The Type of data to return     
               type = "original")[[1]] # The stage of the analysis for the return data

head(data)
```


\

## Check Interactions 

The first step is to assess the species of RNA present in the dataset, the instance will probably contain inter-RNA interactions and intra-RNA interactions for many different RNAs. A number of tables showing the different RNAs / interactions and the ammount of reads assigned to each can be returned with the following methods:


\

### topTranscripts


```{r, eval = F}
# Returns the RNAs with highest number of assigned reads 
# regardless of whether it is an Inter or Intra - RNA interaction. 
topTranscripts(cds,
               2)  # The number of entries to return

```

\

### topInteractors


```{r, eval = F}
# Returns the RNAs that interact with the RNA of interest
topInteracters(cds, # The rnaCrosslinkDataSet instance
               1)   # The number of entries to return
```

\

### topInteractions


```{r, eval = F}
# Returns the Interacions with the highest number of assigned reads
topInteractions(cds, # The rnaCrosslinkDataSet instance
                2)   # The number of entries to return

```


\

### Get Count Matricies


```{r}
features = featureInfo(cds) # The rnaCrosslinkDataSet instance


# Counts for features at the transcript level
features$transcript

# Counts for features at the family level (last field with  "_" delimited IDs)
features$family

```

---


\


## Clustering 


In the rnaCrosslink data, crosslinking and fragmentation leads to the production of redundant structural information, where the same in vivo structure from different RNA molecules produces slightly different RNA fragments. Clustering of these duplexes that originate from the same place in the reference transcript reduces computational time and allows trimming of these clusters to improve the folding prediction.
To allow clustering, gapped alignments can be described by the transcript coordinates of the left (L) and right (R) side of the reads and by the nucleotides between L and R (g). Reads with similar or identical g values are likely to originate from the same structure of different molecules. In rnaCrosslink-OO, an adjacency matrix is created for all chimeric reads based on the nucleotide difference between their g values (Deltagap). This results in Deltagap = 0 for identically overlapping gaps and increasing Deltagap values for gapped reads with less overlap:

+ Deltagap(gi,gj) = max(width(gi),width(gj)) – widthofIntersection(gi,gj)

For short range interactions ( g <= 10 nt ) the weights are calculated such that the highest weights are given to exactly overlapping gapped alignments and a weight of 0 is assigned to alignments that do not overlap.

+ E(gi, gj) = 10 - Deltagap(gi, gj)

Long range interactions (g >10) are clustered separately and their weights are calculated as follows and edges with weights lower that 0 are set to 0. Meaning that gaps that do not overlap by at least 15 nucleotides are considered in different clusters. 

+ E(gi, gj) = 15 - Deltagap(gi, gj)

From these weights the network can be defined for short- and long-range interaction as: G = (V, E). To identify clusters within the graph (subgraphs) the graph is clustered using random walks with the cluster_waltrap function (steps = 2) from the iGraph packageå, there is an option for users to remove clusters with less than a specified amount of reads.  These clusters often contain a small number of longer L or R sequences due to the random fragmentation in the rnaCrosslink protocol. 


```{r}
# Cluster the reads
clusteredCds = clusterrnaCrosslink(cds = cds,     # The rnaCrosslinkDataSet instance 
                               cores = 1,         # The number of cores
                               stepCount = 2,     # The number of steps in the random walk
                               clusterCutoff = 2) # The minimum number of reads for a cluster to be considered
```


```{r}
# Check status of instance 
clusteredCds
```
\

### Cluster Numbers


```{r}
# Returns the number of clusters in each sample
clusterNumbers(clusteredCds)
```


### Read numbers in clusters
\

```{r}
# Returns the number reads in clusters
readNumbers( clusteredCds)
```

### Cluster Tables
\

The cluster tables contain coordinates of the clusters in data.frame format. Each cluster has a unique ID and size.x corrasponds to the number of reads assigned to that cluster or supercluster. ls, le, rs and le give the coordinates of the interaction. 

```{r,  eval=FALSE}
getData(clusteredCds,        # The object             
        "clusterTableList",  # The Type of data to return     
        "original")[[1]]     # The stage of the analysis for the return data
```

### Cluster GRanges
\

You can also extract a GRanges object of the individual reads and their cluster membership:

```{r,  eval=FALSE }
getData(clusteredCds,         # The object        
        "clusterGrangesList", # The Type of data to return     
        "original")[[1]]      # The stage of the analysis for the return data
```

## Cluster Trimming 
\

Given the assumption that the reads within each cluster likely originate from the same structure in different molecules these clusters can be trimmed to contain the regions from L and R that have the most evidence the clustering and trimming is achieved with the clusterrnaCrosslink and trimClusters methods. 

```{r, eval = F}
# Trim the Clusters
trimmedClusters = trimClusters(clusteredCds = clusteredCds, # The rnaCrosslinkDataSet instance 
                               trimFactor = 1,              # The cutoff for cluster trimming (see above)
                               clusterCutoff = 0)          # The minimum number of reads for a cluster to be considered
```

```{r, eval = F}
# Check status of instance 
trimmedClusters
```

```{r, eval = F}
# Returns the number of clusters in each sample
clusterNumbers(trimmedClusters)
```

```{r, eval = F}
# Returns the number reads in clusters
readNumbers( trimmedClusters)
```

see [Plotting contact matrices](#plotting-contact-matrices)


## Check Cluster Agreement between replicates
\

```{r}
#plotClusterAgreement(trimmedClusters,
#                     "trimmedClusters")
```


```{r}
#plotClusterAgreementHeat(trimmedClusters,
#                         "trimmedClusters")
```

## Folding 
\

The final step is folding, this step populates the `viennaStructures`, `dgs` and `interactionTable` slots. This step can only be run if you have the Vienna package installed and RNAFold in your PATH. 


```{r,error=FALSE,eval = FALSE, results=FALSE,message=FALSE}
# Fold the RNA in part of whole
foldedCds = foldrnaCrosslink(trimmedClusters,
                         rnaRefs = rnaRefs,
                         start = 1600,
                         end = 1869,
                         shape = 0,
                         ensembl = 20,
                         constraintNumber  = 30,
                         evCutoff = 5)
```


```{r,eval = FALSE,}
# Check status of instance 
foldedCds
```

---

# Other Plots and Functionality


## Plotting contact matrices 
\

Plots can be made for each sample using the *plotMatrices* function. 

```{r}
# Plot heatmaps for each sample
#plotMatrices(cds = cds,         # The rnaCrosslinkDataSet instance 
#             type = "original", # The "analysis stage"
#             directory = 0,     # The directory for output (0 for standard out)
#             a = 1,             # Start coord for x-axis
#             b = rnaSize(cds),  # End coord for x-axis
#             c = 1,             # Start coord for y-axis
#             d = rnaSize(cds),  # End coord for y-axis
#             h = 5)             # The height of the image (if saved)

```

A plot for two chosen samples and analysis stages in the analysis can be made using the *plotCombinedMatrix* function.

```{r}
plotCombinedMatrix(cds,
           type1 = "original",
           type2 = "noHost",
           b = rnaSize(cds),
           d = rnaSize(cds))
```

See which samples and analysis stages are available by checking the status of the object. 

```{r}
clusteredCds
```

A plot for all samples can be made using the *plotMatricesAverage* function. The plot can display up to two chosen analysis stages on separate halves.

```{r}
# Plot heatmaps for all samples combined and all controls combined
plotMatricesAverage(cds = clusteredCds, # The rnaCrosslinkDataSet instance 
             type1 = "originalClusters", # The "analysis stage" to plot on the upper half of the heatmap
             type2 = "noHost",        # The "analysis stage" to plot on the lower half of the heatmap
             directory = 0,     # The directory for output (0 for standard out)
             a = 1,             # Start coord for x-axis
             b = rnaSize(cds),  # End coord for x-axis
             c = 1,             # Start coord for y-axis
             d = rnaSize(cds),  # End coord for y-axis
             h = 5)             # The height of the image (if saved)
```

---


## Identifying domains 
\

With large RNAS (?500bp), it can be useful to segment the RNA and fold the segments seaparately. DNA and RNA that form secondary and tertiary structures often have domains where there is more inter-domain interactions that inra-domain interactions. The *TopDom* package was designed to identify these domains for HI-C data. Using this package you can identify domains in the RNA structural data and can be used to inform the folding. 

```{r, eval = FALSE}
domainDF = data.frame()
for(j in c(20,30,40,50,60,70)){
    #for(i in which(sampleTable(cds)$group == "s")){
    
    timeMats = as.matrix(getData(x = cds, 
                                 data = "matrixList", 
                                 type = "noHost")[[1]])
    
    timeMats = timeMats/ (sum(timeMats)/1000000)
    tmp = tempfile()
    write.table(timeMats, file = tmp,quote = F,row.names = F, col.names = F)
    
    tdData2 = readHiC(
        file = tmp,
        chr = "rna18s",
        binSize = 10,
        debug = getOption("TopDom.debug", FALSE)
    )
    
    tdData =  TopDom(
        tdData2 ,
        window.size = j,
        outFile = NULL,
        statFilter = TRUE,
        debug = getOption("TopDom.debug", FALSE)
    )
    
    td = tdData$domain
    td$sample = sampleTable(cds)$sampleName[1]
    td$window = j
    domainDF = rbind.data.frame(td, domainDF)
    
}


ggplot(domainDF) +
    geom_segment(aes(x = from.coord/10,
                     xend = to.coord/10, y = as.factor(sub("s","",sample)),
                     yend = (as.factor(sub("s","",sample)) ), colour = tag),
                 size  = 20, alpha = 0.8) +
    facet_grid(window~.)+
    theme_bw()

```

![](vignette_domain.jpg)
---


## Analyse folds

\


### Ensemble Stats

### HotSpots of difference 

### PCA of ensembl

A PCA of the structural ensembl can be made. 

```{r,eval = FALSE,}
plotEnsemblePCA(foldedCds, 
                labels = T, # plot labels for structures
                split = T)  # split samples over different facets (T/f)

```

![](vigentte_pca.jpg)


\

### Plot to strcutures on an arc diagram

Compare two structures from the ensembl


```{r,eval = FALSE,}
plotComparisonArc(foldedCds = foldedCds,
                  s1 = "c1",            # The sample of the 1st structure
                  s2 = "s1",            # The sample of the 2nd structure
                  n1 = 13,               # The number of the 1st structure
                  n2 = 16)               # The number of the 2nd structure
```
![](vigentte_arc.jpg)

\


### Plot one structure

```{r eval = F}
plotStructure(foldedCds = foldedCds, 
              rnaRefs = rnaRefs,     
              s = "s1",          # The sample of the structure
              n = 1)             # The number of the structure
```
\


## Inter RNA interactions- Plot an interacting partner 


Along with the RNA of interest the data also contains inter-RNA interactions with other RNAs from the transcriptome reference. After identifying abundant interactions using topInteractions you can find out where on each RNA these inetractions occur using *getInteractions* and *getReverseInteractions*.

```{r,  eval=FALSE}

getInteractions(cds,
                "ENSG00000XXXXXX_NR003287-2_RN28S1_rRNA") %>%
    mutate(sample =sub("\\d$","",sample) )%>%
    group_by(rna,Position,sample)%>%
    summarise(sum =  sum(depth)) %>%
    ggplot()+
    geom_area(aes(x = Position,
                  y = sum, 
                  fill = sample), 
              stat = "identity")+
    facet_grid(sample~.) +
    theme_bw()

```


```{r,  eval=FALSE}
getReverseInteractions(cds,
                       rna) %>%
    mutate(sample =sub("\\d$","",sample) )%>%
    group_by(rna,Position,sample)%>%
    summarise(sum =  sum(depth)) %>%
    ggplot()+
    geom_area(aes(x = Position,
                  y = sum, 
                  fill = sample), 
                    stat = "identity")+
    facet_grid(sample~.)+
    theme_bw()
```

These interactions can be plotted as contact matrices using *plotInteractions*.

```{r}
plotInteractions(cds,
                 rna = "ENSG000000XXXXX_NR003286-2_RN18S1_rRNA",
                 interactor = "ENSG00000XXXXXX_NR003287-2_RN28S1_rRNA",
                 b = "max",
                 d = "max")
```

Plot heatmaps of interactions for all samples combined and all controls combined using *plotInteractionsAverage*

```{r, eval = F}
plotInteractionsAverage(cds,
                 rna = "ENSG000000XXXXX_NR003286-2_RN18S1_rRNA",
                 interactor = "ENSG00000XXXXXX_NR003287-2_RN28S1_rRNA",
                 b = "max",
                 d = "max")
```

\

## Compare to the "Known" structure 


The clusters can be compared to set of interactions to see which clusters share coordinates with a this set of interactions. The table should be formatted as a tabale fame of 2 columns (i and j) each colunn containing numerical values giving an interaction between i and j with which the clusters should be compared.

\

### Make a contact matrix of known / comprative interactions


To compare to set of know interactions you need a contact matrix these interactions, for plotting it is sometimes useful to expand the interactions so they can be seen easily. 

```{r, eval = F}
expansionSize = 5
knownMat = matrix(0, nrow = rnaSize(cds), ncol = rnaSize(cds))
for(i in 1:nrow(known18S)){
    knownMat[ (known18S$V1[i]-expansionSize):(known18S$V1[i]+expansionSize),
              (known18S$V2[i]-expansionSize):(known18S$V2[i]+expansionSize)] =
        knownMat[(known18S$V1[i]-expansionSize):(known18S$V1[i]+expansionSize),
                 (known18S$V2[i]-expansionSize):(known18S$V2[i]+expansionSize)] +1
}
knownMat = knownMat + t(knownMat)

```

\

### Compare the Clusters


Using *compareKnown* you can check which clusters agree with the set of interactions. This functions adds analysis stages "known", "novel" and "knownAndNovel" to the objects data attributes. 

```{r, eval = F}
# use compare known to gett he known and not know clusters
knowClusteredCds = compareKnown(trimmedClusters, # The rnaCrosslinkDataSet instance 
                                knownMat, # A contact matrix of know interactions
                                "trimmedClusters") # The analysis stage of clustering to compare 

knowClusteredCds
```

\

### Plot the overlapping clusters


You can plot these using the *plotMatrices* function

```{r, eval = F}
# Plot heatmaps for all samples combined and all controls combined
plotMatricesAverage(cds = knowClusteredCds, # The rnaCrosslinkDataSet instance 
             type1 = "KnownAndNovel", # The "analysis stage"
             directory = 0,     # The directory for output (0 for standard out)
             a = 1,             # Start coord for x-axis
             b = rnaSize(cds),  # End coord for x-axis
             c = 1,             # Start coord for y-axis
             d = rnaSize(cds),  # End coord for y-axis
             h = 5)             # The hight of the image (if saved)

```

\

### Cluster Numbers


```{r, eval = F}
# Get the number of clusters for each analysis Stage
clusterNumbers(knowClusteredCds)
```

\

### Read numbers in clusters


```{r, eval = F}
# Get the number of reads in each cluster for each analysis stage
readNumbers(knowClusteredCds)
```


\

### Compare structures with a set of contacts 


To compare predicted structures with the know stucture use "compareKnownStructures". This will give you the number of base pairs that agree between the ensembl of predicted structures and the structure imputted for comparison. This can be for better viewing. 


```{r,eval = FALSE}
head(compareKnownStructures(foldedCds, 
                            known18S)) # the comarison set
```

```{r,eval = FALSE}
ggplot(compareKnownStructures(foldedCds, known18S)) +
    geom_hline(yintercept = c(0.5,0.25,0.75,0,1), 
               colour = "grey", 
               alpha = 0.2)+
    geom_vline(xintercept = c(0.5,0.25,0.75,0,1), 
               colour = "grey", 
               alpha = 0.2)+
    geom_point(aes(x = sensitivity, 
                   y = precision, 
                   size = as.numeric(as.character(unlist(foldedCds@dgs))),
                   colour = str_sub(structureID, 
                                    start = 1 , 
                                    end = 2))) +
    xlim(0,1)+
    ylim(0,1)+
    theme_classic()

```


# Quick start
\
The package ships with a subsetted version of the unenriched dataset which can be used to follow the vignette, the full dataset can be found here: [Un-enriched rRNA dataset](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246412). There is also a function `makeExamplernaCrosslinkDataSet()` from the rnaCrosslinkOO package that will create a simple toy example rnaCrosslinkDataSet object, follow this small section to get a feel for the package and it's functionality. The dataset has 1 sample consisting of a control and sample.  

```{r, eval = FALSE}
# make an example dataset
cds = makeExamplernaCrosslinkDataSet()
cds
```

By plotting the reads on a 2D heatmap you can see structured areas of the transcript and long / short distance inter-RNA interactions. The "type" argument can be changed to plot any analysis stage in the matrixList slot which will become populated as the analysis proceeds. 

```{r, eval = FALSE}
# Have a look at the reads for each sample 
plotMatricesAverage(cds = cds,
                    type1 = "original",
                    directory = 0,
                    a = 1,
                    b = rnaSize(cds),
                    c = 1,
                    d = rnaSize(cds),
                    h = 5)

```


Find the RNAs that interact with the RNA of interest. 

```{r, eval = FALSE}
topInteracters(cds,2)
```

Get some more information about the RNAs in the sample and thier abundance in each sample. With more RNAS and transcripts this becomes very useful as a count matrix. 

```{r, eval = FALSE}
featureInfo(cds)
```

Clustering and cluster trimming.

```{r, eval = FALSE}
clusteredCds = clusterrnaCrosslink(cds = cds,
                               cores = 1,
                               stepCount = 2,
                               clusterCutoff = 0)


trimmedClusters = trimClusters(clusteredCds = clusteredCds,
                               trimFactor = 1, 
                               clusterCutoff = 0)
```


Plot them to check the clustering and trimming is working as expected.

```{r, eval = FALSE}
plotMatricesAverage(cds = clusteredCds,
                    type1 = "originalClusters",
                    directory = 0,
                    a = 1,
                    b = rnaSize(cds),
                    c = 1,
                    d = rnaSize(cds),
                    h = 5)
```

```{r, eval = FALSE}
plotMatricesAverage(cds = trimmedClusters,
                    type1 = "trimmedClusters",
                    directory = 0,
                    a = 1,
                    b = rnaSize(cds),
                    c = 1,
                    d = rnaSize(cds),
                    h = 5)

```


Folding:

```{r, eval = FALSE}

fasta = paste(c(rep('A',25),
                rep('T',25),
                rep('A',10),
                rep('T',23)),collapse = "")

header = '>transcript1'


fastaFile = tempfile()
writeLines(paste(header,fasta,sep = "\n"),con = fastaFile)


rnaRefs = list()
rnaRefs[[rnas(cds)]] = read.fasta(fastaFile)
rnaRefs


foldedCds = foldrnaCrosslink(trimmedClusters,
                         rnaRefs = rnaRefs,
                         start = 1,
                         end = 83,
                         shape = 0,
                         ensembl = 5,
                         constraintNumber  = 1,
                         evCutoff = 1)

```

If you have a set of nucloetide contacts,  you can check to see if they exist in the set of predicted structures after folding. Also see `compareKnown` which is a similar function but tests the set of interactions against the clusters.

```{r, eval = FALSE}
# make an example table of "know" interactions
file = data.frame(V1 = c(6),
                  V2 = c(80))

compareKnownStructures(foldedCds,file)

```

Plot a PCA of the predicted structures

```{r, eval = FALSE}

plotEnsemblePCA(foldedCds, labels = T,split = T)
```

Compare two structures

```{r, eval = FALSE}
plotComparisonArc(foldedCds = foldedCds,"s1","c1",2,3)
```

Plot one structure


```{r, eval = FALSE}
plotStructure(foldedCds = foldedCds,"c1",1)


```

---