Bioinformatics

Mostly gleaned from Rosalind. The table of contents should be:

  1. Finding hidden messages in DNA
  2. Genome Sequencing
  3. Comparing Genes, Proteins and Genomes
  4. Deciphering Molecular Evolution
    • Are there fragile regions in the human genome?
  5. Genomic Data Science and Clustering
  6. Finding mutations in DNA and Proteins

Table of Contents

Rapid Introduction to Molecular Biology

  • Cell is the building block of life.
  • The nucleus, a component of eukaryotic cells, was identified as the hub of cellular activity 150 years ago.
  • Contains a stew of macromolecules called chromatin.
  • During mitosis, more chromatin condeneses into long, thin strings called chromosomes.
  • On class of macromolecules contained in chromatin are called nucleic acids - polymers made up of monomers, or nucleotides
  • Forms three parts: a sugar molecule, a negatively charged ion known as a phosphate, and a compound called a nucleobase.
  • Polymerisation is achieved as the sugar of one nucleotide bonds to the phosphate of the next nucleotide in the chain, which forms a sugar-phosphate backbone.
  • Ordering of bases defines a nucleic acid's primary structure

The Second Nucleic Acid

  • Primary structure of nucleic acid described as polymer of nucleotide units
  • 2nd nucleic acid exists alongside DNA - different sugar called ribose instead of phosphate.
  • Also, contains uracil instead of thymine.
  • Similar primary structure between DNA and RNA.
  • DNA goes to mRNA (during RNA transciption)
  • RNA can enter far reaches of cell to carry out DNA's instructions.

The secondary and tertiary structures of DNA

  • Primary structure does not tell us about larger, 3D shape of molecule, required for complete understanding.
  • A always binds to T
  • C always binds to G
  • bonding of two complementary bases

Secondary structure

  • Base pairing bonds of the bases of nucleotides with each other.
  • In RNA, base pairing does not occur at every position.
  • In DNA, every base from one strand bonds with its complement in the opposing strand, makes the molecule take the shape of a long ladder or staircase.

(In general, secondary structure is the general 3D form of local segments of biopolymers such as proteins and nucleic acids)

Tertiary structure

In general, refers to the actual 3D shape that the molecule assumes.

Computing GC content

To determine unknown language quickly, analyse frequency with which each letter appeared in the text. Language tends to exhibit its own letter frequencies, so as long as the text is long enough, software can identify the language.

Although two members of the same species will have different genomes, they still share vast percentage of their DNA, about 99.9^ of the 3.2 billion base pairs in a human genome are common to almost all humans.

C and G appears in equal amounts in a DNA molecule. To analyse the symbol frequencies, can compute the molecule's GC-content.

  • Most eukaryotic genomes hover around 50%
  • Most prokaryotes have different GC-content - higher than 50%.
  • Can be used to differentiate prokaryotes and eukaryotes

GC content corresponds to coding-sequence length. Low GC content in stop codons, so therefore there is likely to be functional significance.

Lower the GC content, more conserved regions.

Genetic code

  • Start codon: AUG

  • End codon: UAA, UAG or UGA

  • Ribosome creates peptides using the tRNA molecule

  • tRNA possess a string of 3 RNA nucleotides on one end (called an anticodon) and an animo acid on the other.

  • Ribosome takes RNA transscribed from DNA (mRNA) - examines it one codon at a time.

  • At each step, tRNA possessing complementary anticodon bonds to the mRNA at the location, and the animo acid found on the opposite end of the tRNA is added to growing peptide chain before remaining part of tRNA is ejected into cell, and ribosome looks for next tRNA molecule.

Assembling genomes

  • Multiple identical copies of a genome is shattered into 'reads'
  • Reads are sequenced, then assembled by using overlapping reads
  • A priori - knowledge which proceeds from theoretical deduction, rather than observation/experience
    • DNA is double stranded - no way to know which strand a given read derives from. Will not know whether to use a read or its reverse complement when assembling a particular strand of a genome.
    • Reads are not perfect
    • Some regions of genome may not be covered by any reads - impossible to reconstruct the entire genome
    • Algorithm to use is the Eulerian Path, for constructing the reads

1. Finding Hidden Messages in DNA

The aim is to find the replication origin. Do this by finding "DNA boxes" that might indicate to DNA polymerase that replication should start there.

Motifs

  • Daily schedules of organisms controlled by internal timekeeper called the circadian clock. This must have a molecular basis
  • Delayed sleep-phase syndrome (DSPS)
  • Early 1970s, Konopka and Benzer traced abnormal circadian patterns to a single gene in fly.
  • Genes encode proteins, which dictate cell function. This is known as gene expression. This must change throughout the day for plants, and us as well (heart attacks more common in the morning, asthma attacks more common in the evening)
  • Sunlight activates LCY, CCA1
  • LCY, CCA1 represses TOC1
  • Production of LCY, CCA1 stops at night time
  • TOC1 promotes LCY, CCA
  • ...just in time for daybreak again
  • LCY, CCA1 aand TOC1 encode regulatory proteins, which can turn other genes on and off by binding to short sequence of DNA, known as a regulatory motif.

We can use a profile matrix to score how likely a given DNA string is likely to be the motif.

Use greedy algorithm to reduce running time of finding a motif. Select the most attractive alternative at every iteration. Use heuristics that trade accuracy for speed in order to find the approximate solution. We can use the greedy motif search to accomplish this.

One other alternative is to use randomised algorithms. This is either:

  • Las Vegas: solutions guaranteed to be exact, but don't know when they'll terminate
  • Monte Carlo: not guaranteed to be exact, but will quickly find approximate solutions

2. Genome Sequencing

The first sequenced genome, belonging to a φX174 bacterial phage (i.e., a virus that preys on bacteria), had only 5,386 nucleotides and was completed in 1977 by Frederick Sanger.

Biologists still lack the technology to read nucleotides of a genome from beginning and end in the same way a book would be read.

Can only sequence shorter genomes known as reads.

Instead of trying to put the reads together greedily, make a graph instead.

A Hamiltonian path is one path in a graph that visits each node exactly once. Best algorithm is polynomial

An Eulerian trail (or path) is a trail in the graph that visits each edge exactly once.

A Eulerian cycle can be found this in linear time (with the number of nodes) using Hierholzer's algorithm.

An Eulerian circuit or path is one is an Eulerian trail which starts and ends on the same vertex.

Due to the limits of generating good read lengths in order to construct paths, use read pairs - pairs of reads separated by a fixed distance d in the genome.

How Do We Sequence Anitbiotics?

Goal is to find how Bacillus brevis, an antibiotic producing bacteria, could have produced Tyrocidine B1

The different ways of dividing up a DNA string into codons are called reading frames. DNA is double stranded, so that gives 6 reading frames.

Tyrocidine B1 is actually a cyclic peptide; ten different linear representations. However, cannot find any of them in the Bacillus Brevis.

Turns out that Tyrocidine B1 is a non-ribosomal peptide - not synthesised by a ribosome, but by giant protein called NRP synthetase.

Mass spctrometers produce an experimental spectrum from an unknown peptide.

3. Comparing Genes, Proteins and Genomes

https://stepic.org/lesson/Cracking-the-Non-Ribosomal-Code-263/step/3?course=Bioinformatics-Algorithms&unit=467

  • Non-ribosomal code - bacteria and fungi produce antibiotics and non-ribosome

  • Sequence alignment in order to find the longest common subsequence

    • Match (1 point)
    • Insertion (0 points)
    • Deletion (0 points)
    • Mismatch (0 points)
  • Dynamic programming in an arbitrary DAG

    • Topologically sort

Scoring matrices

To Generalise the alignment scoring model, award +1 for matches, but also penalise mismatches by some positive constant µ. Can also use the Blosum62 or PAM 250 matrices for scoring

From here, we can see that the insertions and deletions that are being referred to are with respect to the string down the rows (i.e. string1)

Affine gap penalties

  • If σ is the indel penalty, this is only given for the singular gaps and doesn't take into account contiguous sequences of spaces in a row of alignment
  • Instead, keep σ as the gap opening penalty, ε as the gap extension penalty, and have the affine penalty as σ + ε (k - 1) where k is the number of gaps
  • Naive approach adds a 'long' edge for each of the insertions and deletions:https://stepic.org/lesson/Penalizing-Insertions-and-Deletions-in-Sequence-Alignment-249/step/4 - number of edges increases from O(n²) to O(n³).

Locating Disease Causing Mutations

  • Reads are a few hundred base pairs long - each read forms a collections of strings, Patterns that we wish to match against a genome Text
  • Can use suffix tries, or trees, to find the common substring of multiple strings

5. Genomic Data Science and Clustering

How Did Yeast Become a Wine Maker?

  • 6000 year old winery found in Armenian cave
  • Saccharomyces (cerevisiae) is the genus of yeast used in alcohol and bread production
  • Pen-tailed tree shrews are also alcoholics

Gene expression matrix: n * m, n yeast genes and m points in time for expression levels.

Fragile Regions of the Genome

Gene compression

Run-length encoding is a simple compression technique; replace repeated occurences with single nucleotide and the repitition count:

TTTTTGGGAAAACCCCCCA -> 5T3G4A6C1A

Genomes do not have many runs, but have repeats instead. Convert repeats into runs, then apply run-length encoding

Use the Burrows-Wheeler transform:

  • form all cyclic rotations of Text
  • order them lexicographically into a len(text) * len(text) sized matrix
  • each column is some rearrangement of symbols of Text

Evolutionary Tree Reconstruction

  • If perfect distance matrix, can additive phylogeny construction
  • If non-additive, can use ultrametric or neighbor joining method
  • To reconstruct ancestor, given the genomes of descendants, can use small parsimony to reconstruct ancestor geneome, character by character
  • Given a set of strings, but no tree, can construct a tree with the minimum parsimony score using large parsimony's greedy algorithm

Brewing Yeast

  • Converts glucose to alcohol for energy
  • If out of glucose, will convert alcohol and oxygen to glucose for energy
  • Therefore, keep wine sealed from oxygen to prevent this metabolic inversion, known as a diauxic shift
  • WGDs, whole genome duplications, may explain how a species mutated to gain this diauxic shift feature
  • Yeast converts acetaldehyde into ethanol, catalysed by Adh (alcohol dehydrogenase). Encoded by two genes, Ahd₁ (good at producing ethanol) and Ahd₂ (good at consuming ethanol)
  • In 1997 Joseph DeRisi conducts first gene expression experiment by sampling Saccharomyces culture 7 times. 6400 genes in genome results in a 6400*7 expression matrix
  • Goal is to partition genes into k clusters
  • Good clustering principle: Every pair of points from the same cluster should be closer to each other than any pair of points from different clusters.
  • k-means Clustering maks a hard assignment of each point to only one cluster. Makes little sense for midpoints, or points that are approximately equidistant from two centers - use soft clustering instead
  • Learn about Binomial distributions

Hidden Markov models

  • What HIV p
  • 1984: Flight attendent: AIDS "Patient Zero"
  • How does HIV evade the immune system?
  • 2% per nucleotide per year
  • Syncytium-inducing phenotype: trick human cells into fusing with neighboring cells.

4. Deciphering Molecular Evolution

Are There Fragile Regions in the Human Genome?

  • A reversal flips around an interval of a chromosome; inverts directions of any synteny blocks in that interval

  • Tiny fraction of genome rearrangements may have a positive effect

  • Random breakage model: breakage points of mammalian genomes do not exist

    • If true, should result in exponential distribution of synteny block lengths
    • Humans fit this distribution
  • Greedy sorting can help sort the blocks into the identity permutation, but proves to be a poor approximation for the reversal distance

  • Motivation to study multichromosomal rearrangements as well; rare in species evolution, but cancer cells (chronic myeloid leukemia (CML)) exhibit this behaviour a lot.

  • Translocation: two intervals of DNA are excised from the end of chromosomes, and reattached on the opposite chromosomes - occurs in cancer

  • Fusion: merging two chromosomes into one chromosome

    • 5 million years ago (after human/chimpanzee lineage split), fusion caused two human chromosomes to reduce count from 24 to 23 in one of our ancestors - wow!
  • Fission: breaking apart one chromosome into two chromosome

The breakpoint graph of trivial cycles is known as the trivial breakpoint graph.

Number of synteny blocks in P and Q: Blocks(P,Q)

Coolest thing I read today: "Five million years ago, shortly after the human and chimpanzee lineages split, a fusion of two chromosomes (called 2A and 2B) in one of our ancestors created human chromosome 2 and reduced our chromosome count from 24 to 23."

2-Break Distance Theorem: The 2-break distance between genomes P and Q is equal to Blocks(P,Q) - Cycles(P,Q)

6. Finding Mutations in DNA and Proteins

Motivation for using HMM:

Limitations of conventional sequence alignments - they cannot detect similarity between highly diverged sequences.

HIV viruses in single patient evolve at 2% per nucleotide per year; so diverged that they require different drug cocktails.

One particular HIV phenotype (syncytium-inducing) - causes human cells to fuse together to kill them better. Amino acids at 11 and 25 in the virus are A and K, but alignment algorithms won't be able to pick this up.

Very important to align sequence correctly, so was it a good idea to use the same scoring matrix across different columns of an alignment?

Yakuza Analogy

Read the dealer's mind one window at a time, e.g. using sliding window approach, but

  • Methylation adds CH3 group to nucleotide, but CG islands occur

Hidden Markov Models

  • Can be used to classify proteins into families
  • Weak pairwise similarities

Appendix

Hamming distance

Given two strings of equal length s and t,the Hamming distance d_H(s,t) is the number of corresponding symbols that differ in s and t.

Rabbits

Complexity analyses

While loop of fibonnaci takes only O(n) time:

def fib(n, k = 1):
    a,b = 1,0
    for i in range(n):
        a,b = a+k*b, a
    return b

Recursive implementation takes O(n) time.

Glossary

  • A-domain: 500 amino acid-long sequence that is responsible for adding a single amino accid to the B1 antibiotic
  • Central dogma: of Molecular Biology. DNA makes RNA makes protein
  • DnaA: A protein that binds to a short segment within the oriC, known as the DnaA box
  • Eukaryotic: One of the three domains of life, other two being bacteria and archaea. Other two are unicellular organisms, eukaryotes account for all multicellular life on Earth. Contains complex organelles encased in membranes, i.e. the cellular nucleus. These do division of labour, which means cell can carry out relatively complex tasks.
  • FASTA: DNA and protein sequence alignment software package; FASTA format is now ubiquitous in bioinformatics
  • Homolog: Chromosomes in a polyploid organism that have very similar DNA
  • GC content: percentage of nitrogenous bases in DNA. For humans, it is 42%
  • Genus: taxonomic rank, between family and species
  • Non-ribosomal code: Code that produces (non-ribosomal) peptides without any reliance on the ribosome and the genetic code.
  • NRP syntethase: Enyzme that pieces together antibiotic peptides without reliance on RNA or genetic code. Do not adhere to the Central Dogma
  • Organelle: Structure of cell that serves as central hub for cellular functions.
  • oriC: Replication origin; where DNA polymerases (molecular copy machines) begin creating the DNA complement of the genome
  • Polymerisation: The joining of monomers (molecules) into a polymer chain
  • Prokaryotic: A bacterial organism. Lack organelles. Reproduction process, binary fission, is much simpler than mitosis
  • Regulatory motif: Also known as a transcription factor binding site, this is a specific short DNA interval found in a gene's upstream region
  • Ribosome: large, complex molecular machine which serves as the primary site of biological protein synthesis
  • Rooted tree: Tree in which a special node (the root, or 'eve') is singled out. Usually the one with no parent
  • Saccharomyces: genus in the kingdom of fungi, that includes many species of yeast
  • Single nucleotide polymorphism (SNP): A read that indicates a mutation of one nucleotide to another
  • Subsequence: sequence of symbols appearing in the same order (but not necessarily consecutively)
  • Suboptimal alignment: Dynamic programming technique used to all all occurences of a certain repeat
  • Synteny block: procession of similar genes of two species genomes, with a direction
  • Transposon: Short intervals of DNA that can change their positions (jumping genes)
  • Transcription factor: Master regulatory proteins that can turn other genes on and off. Binds to the regulatory motif
  • Upstream region: a 600-1000 nucleotide-long region preceding the start of any gene
  • Unichromosomal: Pertaining to a single line of genome
  • Unrooted binary tree: An unrooted tree with each node having either one or three nodes
  • Viral vector: Genetically engineered mini-genomes, capable of penetrating cell walls