Bioinformatics

Mostly gleaned from Rosalind. The table of contents should be:

Finding hidden messages in DNA
Genome Sequencing
Comparing Genes, Proteins and Genomes
Deciphering Molecular Evolution
- Are there fragile regions in the human genome?
Genomic Data Science and Clustering
Finding mutations in DNA and Proteins

Rapid Introduction to Molecular Biology
1. Finding Hidden Messages in DNA
1. Motifs
2. Genome Sequencing
1. How Do We Sequence Anitbiotics?
3. Comparing Genes, Proteins and Genomes
5. Genomic Data Science and Clustering
4. Deciphering Molecular Evolution
1. Are There Fragile Regions in the Human Genome?
6. Finding Mutations in DNA and Proteins
1. Yakuza Analogy
2. Hidden Markov Models
Appendix
1. Hamming distance
2. Rabbits
Glossary

Rapid Introduction to Molecular Biology

Cell is the building block of life.
The nucleus, a component of eukaryotic cells, was identified as the hub of cellular activity 150 years ago.
Contains a stew of macromolecules called chromatin.
During mitosis, more chromatin condeneses into long, thin strings called chromosomes.
On class of macromolecules contained in chromatin are called nucleic acids - polymers made up of monomers, or nucleotides
Forms three parts: a sugar molecule, a negatively charged ion known as a phosphate, and a compound called a nucleobase.
Polymerisation is achieved as the sugar of one nucleotide bonds to the phosphate of the next nucleotide in the chain, which forms a sugar-phosphate backbone.
Ordering of bases defines a nucleic acid's primary structure

The Second Nucleic Acid

Primary structure of nucleic acid described as polymer of nucleotide units
2nd nucleic acid exists alongside DNA - different sugar called ribose instead of phosphate.
Also, contains uracil instead of thymine.
Similar primary structure between DNA and RNA.
DNA goes to mRNA (during RNA transciption)
RNA can enter far reaches of cell to carry out DNA's instructions.

The secondary and tertiary structures of DNA

Primary structure does not tell us about larger, 3D shape of molecule, required for complete understanding.
A always binds to T
C always binds to G
bonding of two complementary bases

Secondary structure

Base pairing bonds of the bases of nucleotides with each other.
In RNA, base pairing does not occur at every position.
In DNA, every base from one strand bonds with its complement in the opposing strand, makes the molecule take the shape of a long ladder or staircase.

(In general, secondary structure is the general 3D form of local segments of biopolymers such as proteins and nucleic acids)

Tertiary structure

In general, refers to the actual 3D shape that the molecule assumes.

Computing GC content

To determine unknown language quickly, analyse frequency with which each letter appeared in the text. Language tends to exhibit its own letter frequencies, so as long as the text is long enough, software can identify the language.

Although two members of the same species will have different genomes, they still share vast percentage of their DNA, about 99.9^ of the 3.2 billion base pairs in a human genome are common to almost all humans.

C and G appears in equal amounts in a DNA molecule. To analyse the symbol frequencies, can compute the molecule's GC-content.

Most eukaryotic genomes hover around 50%
Most prokaryotes have different GC-content - higher than 50%.
Can be used to differentiate prokaryotes and eukaryotes

GC content corresponds to coding-sequence length. Low GC content in stop codons, so therefore there is likely to be functional significance.

Lower the GC content, more conserved regions.

Genetic code

Start codon: AUG
End codon: UAA, UAG or UGA
Ribosome creates peptides using the tRNA molecule
tRNA possess a string of 3 RNA nucleotides on one end (called an anticodon) and an animo acid on the other.
Ribosome takes RNA transscribed from DNA (mRNA) - examines it one codon at a time.
At each step, tRNA possessing complementary anticodon bonds to the mRNA at the location, and the animo acid found on the opposite end of the tRNA is added to growing peptide chain before remaining part of tRNA is ejected into cell, and ribosome looks for next tRNA molecule.

Assembling genomes

Multiple identical copies of a genome is shattered into 'reads'
Reads are sequenced, then assembled by using overlapping reads
A priori - knowledge which proceeds from theoretical deduction, rather than observation/experience
- DNA is double stranded - no way to know which strand a given read derives from. Will not know whether to use a read or its reverse complement when assembling a particular strand of a genome.
- Reads are not perfect
- Some regions of genome may not be covered by any reads - impossible to reconstruct the entire genome
- Algorithm to use is the Eulerian Path, for constructing the reads

1. Finding Hidden Messages in DNA

The aim is to find the replication origin. Do this by finding "DNA boxes" that might indicate to DNA polymerase that replication should start there.

Motifs

Daily schedules of organisms controlled by internal timekeeper called the circadian clock. This must have a molecular basis
Delayed sleep-phase syndrome (DSPS)
Early 1970s, Konopka and Benzer traced abnormal circadian patterns to a single gene in fly.
Genes encode proteins, which dictate cell function. This is known as gene expression. This must change throughout the day for plants, and us as well (heart attacks more common in the morning, asthma attacks more common in the evening)
Sunlight activates LCY, CCA1
LCY, CCA1 represses TOC1
Production of LCY, CCA1 stops at night time
TOC1 promotes LCY, CCA
...just in time for daybreak again
LCY, CCA1 aand TOC1 encode regulatory proteins, which can turn other genes on and off by binding to short sequence of DNA, known as a regulatory motif.

We can use a profile matrix to score how likely a given DNA string is likely to be the motif.

Use greedy algorithm to reduce running time of finding a motif. Select the most attractive alternative at every iteration. Use heuristics that trade accuracy for speed in order to find the approximate solution. We can use the greedy motif search to accomplish this.

One other alternative is to use randomised algorithms. This is either:

Las Vegas: solutions guaranteed to be exact, but don't know when they'll terminate
Monte Carlo: not guaranteed to be exact, but will quickly find approximate solutions

2. Genome Sequencing

The first sequenced genome, belonging to a φX174 bacterial phage (i.e., a virus that preys on bacteria), had only 5,386 nucleotides and was completed in 1977 by Frederick Sanger.

Biologists still lack the technology to read nucleotides of a genome from beginning and end in the same way a book would be read.

Can only sequence shorter genomes known as reads.

Instead of trying to put the reads together greedily, make a graph instead.

A Hamiltonian path is one path in a graph that visits each node exactly once. Best algorithm is polynomial

An Eulerian trail (or path) is a trail in the graph that visits each edge exactly once.

A Eulerian cycle can be found this in linear time (with the number of nodes) using Hierholzer's algorithm.

An Eulerian circuit or path is one is an Eulerian trail which starts and ends on the same vertex.

Due to the limits of generating good read lengths in order to construct paths, use read pairs - pairs of reads separated by a fixed distance d in the genome.

How Do We Sequence Anitbiotics?

Goal is to find how Bacillus brevis, an antibiotic producing bacteria, could have produced Tyrocidine B1

The different ways of dividing up a DNA string into codons are called reading frames. DNA is double stranded, so that gives 6 reading frames.

Tyrocidine B1 is actually a cyclic peptide; ten different linear representations. However, cannot find any of them in the Bacillus Brevis.

Turns out that Tyrocidine B1 is a non-ribosomal peptide - not synthesised by a ribosome, but by giant protein called NRP synthetase.

Mass spctrometers produce an experimental spectrum from an unknown peptide.

3. Comparing Genes, Proteins and Genomes

https://stepic.org/lesson/Cracking-the-Non-Ribosomal-Code-263/step/3?course=Bioinformatics-Algorithms&unit=467

Non-ribosomal code - bacteria and fungi produce antibiotics and non-ribosome
Sequence alignment in order to find the longest common subsequence
- Match (1 point)
- Insertion (0 points)
- Deletion (0 points)
- Mismatch (0 points)
Dynamic programming in an arbitrary DAG
- Topologically sort

Scoring matrices

To Generalise the alignment scoring model, award +1 for matches, but also penalise mismatches by some positive constant µ. Can also use the Blosum62 or PAM 250 matrices for scoring

From here, we can see that the insertions and deletions that are being referred to are with respect to the string down the rows (i.e. string1)

Affine gap penalties

If σ is the indel penalty, this is only given for the singular gaps and doesn't take into account contiguous sequences of spaces in a row of alignment
Instead, keep σ as the gap opening penalty, ε as the gap extension penalty, and have the affine penalty as σ + ε (k - 1) where k is the number of gaps
Naive approach adds a 'long' edge for each of the insertions and deletions:https://stepic.org/lesson/Penalizing-Insertions-and-Deletions-in-Sequence-Alignment-249/step/4 - number of edges increases from O(n²) to O(n³).

Locating Disease Causing Mutations

Reads are a few hundred base pairs long - each read forms a collections of strings, Patterns that we wish to match against a genome Text
Can use suffix tries, or trees, to find the common substring of multiple strings

5. Genomic Data Science and Clustering

How Did Yeast Become a Wine Maker?

6000 year old winery found in Armenian cave
Saccharomyces (cerevisiae) is the genus of yeast used in alcohol and bread production
Pen-tailed tree shrews are also alcoholics

Gene expression matrix: n * m, n yeast genes and m points in time for expression levels.

Fragile Regions of the Genome

Gene compression

Run-length encoding is a simple compression technique; replace repeated occurences with single nucleotide and the repitition count:

TTTTTGGGAAAACCCCCCA -> 5T3G4A6C1A

Genomes do not have many runs, but have repeats instead. Convert repeats into runs, then apply run-length encoding

Use the Burrows-Wheeler transform:

form all cyclic rotations of Text
order them lexicographically into a len(text) * len(text) sized matrix
each column is some rearrangement of symbols of Text

Evolutionary Tree Reconstruction

If perfect distance matrix, can additive phylogeny construction
If non-additive, can use ultrametric or neighbor joining method
To reconstruct ancestor, given the genomes of descendants, can use small parsimony to reconstruct ancestor geneome, character by character
Given a set of strings, but no tree, can construct a tree with the minimum parsimony score using large parsimony's greedy algorithm

Brewing Yeast

Converts glucose to alcohol for energy
If out of glucose, will convert alcohol and oxygen to glucose for energy
Therefore, keep wine sealed from oxygen to prevent this metabolic inversion, known as a diauxic shift
WGDs, whole genome duplications, may explain how a species mutated to gain this diauxic shift feature
Yeast converts acetaldehyde into ethanol, catalysed by Adh (alcohol dehydrogenase). Encoded by two genes, Ahd₁ (good at producing ethanol) and Ahd₂ (good at consuming ethanol)
In 1997 Joseph DeRisi conducts first gene expression experiment by sampling Saccharomyces culture 7 times. 6400 genes in genome results in a 6400*7 expression matrix
Goal is to partition genes into k clusters
Good clustering principle: Every pair of points from the same cluster should be closer to each other than any pair of points from different clusters.
k-means Clustering maks a hard assignment of each point to only one cluster. Makes little sense for midpoints, or points that are approximately equidistant from two centers - use soft clustering instead
Learn about Binomial distributions

Hidden Markov models

What HIV p
1984: Flight attendent: AIDS "Patient Zero"
How does HIV evade the immune system?
2% per nucleotide per year
Syncytium-inducing phenotype: trick human cells into fusing with neighboring cells.

4. Deciphering Molecular Evolution

Are There Fragile Regions in the Human Genome?

A reversal flips around an interval of a chromosome; inverts directions of any synteny blocks in that interval
Tiny fraction of genome rearrangements may have a positive effect
Random breakage model: breakage points of mammalian genomes do not exist
- If true, should result in exponential distribution of synteny block lengths
- Humans fit this distribution
Greedy sorting can help sort the blocks into the identity permutation, but proves to be a poor approximation for the reversal distance
Motivation to study multichromosomal rearrangements as well; rare in species evolution, but cancer cells (chronic myeloid leukemia (CML)) exhibit this behaviour a lot.
Translocation: two intervals of DNA are excised from the end of chromosomes, and reattached on the opposite chromosomes - occurs in cancer
Fusion: merging two chromosomes into one chromosome
- 5 million years ago (after human/chimpanzee lineage split), fusion caused two human chromosomes to reduce count from 24 to 23 in one of our ancestors - wow!
Fission: breaking apart one chromosome into two chromosome

The breakpoint graph of trivial cycles is known as the trivial breakpoint graph.

Number of synteny blocks in P and Q: Blocks(P,Q)

Coolest thing I read today: "Five million years ago, shortly after the human and chimpanzee lineages split, a fusion of two chromosomes (called 2A and 2B) in one of our ancestors created human chromosome 2 and reduced our chromosome count from 24 to 23."

2-Break Distance Theorem: The 2-break distance between genomes P and Q is equal to Blocks(P,Q) - Cycles(P,Q)

6. Finding Mutations in DNA and Proteins

Motivation for using HMM:

Limitations of conventional sequence alignments - they cannot detect similarity between highly diverged sequences.

HIV viruses in single patient evolve at 2% per nucleotide per year; so diverged that they require different drug cocktails.

One particular HIV phenotype (syncytium-inducing) - causes human cells to fuse together to kill them better. Amino acids at 11 and 25 in the virus are A and K, but alignment algorithms won't be able to pick this up.

Very important to align sequence correctly, so was it a good idea to use the same scoring matrix across different columns of an alignment?

Yakuza Analogy

Read the dealer's mind one window at a time, e.g. using sliding window approach, but

Methylation adds CH3 group to nucleotide, but CG islands occur

Hidden Markov Models

Can be used to classify proteins into families
Weak pairwise similarities

Appendix

Hamming distance

Given two strings of equal length s and t,the Hamming distance d_H(s,t) is the number of corresponding symbols that differ in s and t.

Rabbits

Complexity analyses

While loop of fibonnaci takes only O(n) time:

def fib(n, k = 1):
    a,b = 1,0
    for i in range(n):
        a,b = a+k*b, a
    return b

Recursive implementation takes O(n) time.

Glossary

A-domain: 500 amino acid-long sequence that is responsible for adding a single amino accid to the B1 antibiotic
Central dogma: of Molecular Biology. DNA makes RNA makes protein
DnaA: A protein that binds to a short segment within the oriC, known as the DnaA box
Eukaryotic: One of the three domains of life, other two being bacteria and archaea. Other two are unicellular organisms, eukaryotes account for all multicellular life on Earth. Contains complex organelles encased in membranes, i.e. the cellular nucleus. These do division of labour, which means cell can carry out relatively complex tasks.
FASTA: DNA and protein sequence alignment software package; FASTA format is now ubiquitous in bioinformatics
Homolog: Chromosomes in a polyploid organism that have very similar DNA
GC content: percentage of nitrogenous bases in DNA. For humans, it is 42%
Genus: taxonomic rank, between family and species
Non-ribosomal code: Code that produces (non-ribosomal) peptides without any reliance on the ribosome and the genetic code.
NRP syntethase: Enyzme that pieces together antibiotic peptides without reliance on RNA or genetic code. Do not adhere to the Central Dogma
Organelle: Structure of cell that serves as central hub for cellular functions.
oriC: Replication origin; where DNA polymerases (molecular copy machines) begin creating the DNA complement of the genome
Polymerisation: The joining of monomers (molecules) into a polymer chain
Prokaryotic: A bacterial organism. Lack organelles. Reproduction process, binary fission, is much simpler than mitosis
Regulatory motif: Also known as a transcription factor binding site, this is a specific short DNA interval found in a gene's upstream region
Ribosome: large, complex molecular machine which serves as the primary site of biological protein synthesis
Rooted tree: Tree in which a special node (the root, or 'eve') is singled out. Usually the one with no parent
Saccharomyces: genus in the kingdom of fungi, that includes many species of yeast
Single nucleotide polymorphism (SNP): A read that indicates a mutation of one nucleotide to another
Subsequence: sequence of symbols appearing in the same order (but not necessarily consecutively)
Suboptimal alignment: Dynamic programming technique used to all all occurences of a certain repeat
Synteny block: procession of similar genes of two species genomes, with a direction
Transposon: Short intervals of DNA that can change their positions (jumping genes)
Transcription factor: Master regulatory proteins that can turn other genes on and off. Binds to the regulatory motif
Upstream region: a 600-1000 nucleotide-long region preceding the start of any gene
Unichromosomal: Pertaining to a single line of genome
Unrooted binary tree: An unrooted tree with each node having either one or three nodes
Viral vector: Genetically engineered mini-genomes, capable of penetrating cell walls

This content has been automatically generated. For feedback or errata, please email eddie at xiaodi.li.