Friday, August 5, 2016

Notes, Week 5

Scientific literature is being published at a rate that we can’t keep up with.
Test mining extracts information from publicied scoreces. Tools like these are becoming essential to keep up with the amount of information. 

The increase in this literature is making it increasingly difficultt for researchers to stay up to date and it makes it harder to come up with a meaningful and testable hypothesis. 

Test ,omomg os the process of extrecting information in text through techniques such as information retrieval, machine learning, natural language processing, statistics, and computational linguistics. 

This slows the time needed to read through the information in literature. 

Most research in NER focuses on recognising gene and protein mentions/. Some work has also been done on identifying cell lines. 

Dictionary band methods work by matching text against a fixed dictionary of entity names. It is dependent on coverage of dictionary and performance of matching techniques.

Rule based methods use orthographic and morph-syntaitic features to generate patterns and rules. They are very precise but not good at recalling information.
Matching learning are best and have the highest performance levels. 

NER is viewed as either a classification or sequence labelling problem. Classification approaches normally consider NER as a assigner. It supports features such as surface clues. and morpho-syntactic features. They only support binary classification. Sequence labeling approaches deduce the most probable sequence of tags. 

Lableing has become complicated as a dictionary based system is limited. 

The standard method of normalisation is to compare an NE against a dictionary of synonyms and identifiers. 

Rule based approaches have been used to try nd normalize terms by applying a set of transformatjons to try and match a term in lexicon.

Swansons ABC model: New hypotheses can emerge and scientific discoveries can be anticipated or stimulated by the investigation of complementary but disjoint literature. 

Sunday, July 31, 2016

Summer Research Week 5 (08/01 - 08/05)

Machine Learning & Biology

  • Read the following paper and take notes: What the papers say: Text mining for genomics and systems biology, Nathan Harmston, Wendy Filsell and Michael PH Stumpf, Human Genomics, 2010. Applying text mining to biological research in genomics and system biology is one of the areas that machine learning can potentially have major impact. 
  • Read through a paper (A framework for the development of Biomedical Text Mining software tools) describing the basic framework of a text mining system for bioinformatics. You don't need to understand the details of every system/tool mentioned in the paper. However, try to capture the main idea about the whole framework and where and how these systems and tools can be put together to help people study biology. 

Tuesday, July 26, 2016

Notes II

Notes II

Chapter 6
Allele one of two or more alternative forms of a gene that arise by mutation and are  found at the same place on a chromosome. 

Homozygous is dominant if it carries two copies of the dominant allele.Homozygous recessive if it carries two copies of the same recessive allele. 

Heterozygous means that an organism has two different alleles of a gene. 

Genetic analysis is the process of genes interacting in a specific phenotype and then are identified by going on a hunt for all the different kids of mutations that affect that phenotype. 

Functional Genomics provides powerful ways of defining the set of genes that participate in any defined system.

Proteomics assays protein interaction.

Regulatory genes are the transcription of one gene that may be turned on and off by other genes.

Genes are responsible for the function of enzymes and each gene controls one specific enzyme. 

All proteins, weather or not they are enzymes, are encoded by genes, so the phrase became one-gene - one-protein or one-gene - one-polypeptide. 

The type of dominance inferred depends on the phenotypic level at which the assay is made. Either organismal, cellular, or molecular. 

Codominance is the relation of two alleles that are both fully expressed. 

Double mutant is an organism that is carrying homozygous mutations of two different genes. 

Epistasis is the phenomenon of one genes dependance on one or more modifier genes. 

Haploinsufficiency is where a diploid organism has only one functional copy of a gene and it does not produce enough gene product to bring about a wild-type condition. 

Heterokaryon is a multinucleate cell that contains genetically different nuclei. 

Penetrance is the amount a gene or set of genes is expressed in the phenotypes of organisms carrying it. 

Pleiotropic effect occurs when one gene influences two or more other phenotypic traits.

Chapter 7
DNA is comprised of phosphate, deoxyribose, and four nitrogenous bases. 

Two of the bases, adenine and guanine, have double-ring structure characteristic of a chemical called a purine. 
The other two bases, cytosine and thymine, have a single-ring structure called pyrimidine. 

Nucleotides are comprised of a phosphate group and a deoxyribose sugar molecule and one of the four bases. 

Antiparallel are DNA in opposite orientation. 

Major groove and minor groove are single strands of nucleotides and have no helical structure. 

Semiconservative replication is the process of the double helix of each daughter DNA molecule contains one strand from the original DNA molecule and one newly synthesized strand. 
Conservative replication is the process of the parent DNA molecule is conserved and a single daughter double helix is produced consisting of two newly synthesized strands. 
Dispersive replication is the process of daughter molecules consistent of strands each containing segments of both parental DNA and newly synthesized DNA. 

Replication fork is the zone where the double helix is unwounded. 
Leading strand is the new strand synthesized from the replication fork. 

Okazaki fragments are short nucleotide stretches of newly synthesized DNA. 

Promises synthesize, by a set of proteins, primers. 

DNA ligase joins the 3’ end to the 5’ end of the downstream Okazaki fragment. 
Lagging strand is the new strand formed. 

Replissme is a molecular machine. 

Pol III holoenzyme consists of two catalytic cores and many accessory proteins. 
Sliding clamp encircles the DNA and keels the polIII attached to the DNA molecule. 

Helices are enzymes that disrupt the hydrogen bonds that hold the two strands of the double helix together. 

Nucleosome consists of DNA wrapped in histone proteins. 

Daughter molecules are two identical versions of DNA.

Telomerase handles the addition of noncoding sequence. 

Chapter 8
Pulse-chase experiment is performed by injecting infected bacteria with radioactive uracil. Any RNA synthesized int he bacteria from then on is labeled with the readily detectable radioactive uracil. After incubation the radioactive uracil is washed away and replaced by uracil that is not radioactive. The RNA recovered shortly after the pulse is labeled recovered somewhat longer after the chase is unlaced indicating that the RNA has a very short lifetime. 

Transfer RNA, molecules are responsible for bringing the correct amino acid to the mRNA in the process of translation
Ribosomal RNAs, are the major components of ribosomes, which are large macromolecular machines that guide the assembly of the amino acid chain by the mRNA and tRNA. 
Small nuclear RNAs, are part of a system that further processes RNA transcripts in eukaryotic cells. 

RNA polymerase is attached to the DNA and moves along it linking the aligned ribonucleotides together to make a growing RNA molecule. 

Three steps in transcription are initiation, elongation and termination. 

Initiation is the binding of RNA polymerase to double stranded DNA. 
Elongation is the addition of nucleotides to the 3’ end of the chain.
Termination is the recognition of the transcription termination sequence. 

Chapter 9
Tertiary structure is produced by the folding of the secondary structure. 
Quarternary structure is composed of two or more separated folded polypeptides. 

Globular proteins are made of compact structures.
Fibrous protein are proteins with a linear shape. 

Collinearity is the correspondence between the linear sequence of the gene and that of the polypeptide. 

Degenerate genetic code means that each of the 64 triplets must have some meaning within the code. For this to be true some of the amino acids have to be specified by atlas two different triplets. 

When a tRNA is attached by a amino acid it is charged. 

Most amino acid can be brought to the ribosome by several tRNA types. Each having a different anticodon base pair. 
Certain charged tRNA species can bring their specific amino acids to any one of the several codons. These tRNAs recognize and bind to several alternative codons through a loose kind of base paring at the 3’ end of the codon and the 5’ end of the anticodon. This is called a wobble. 

Biological Machine is a multisubunit complex that performs cellular functions. 

Decoding center ensures that only tRNAs carrying antics that match the codon will be accepted into the A site. 
Peptidyl transferase center is where peptide-bond formation is catalyzed. 

Chaperones help fold proteins. 

Chapter 10
Gene regulation is the regulation of the synthesis of a genes transcript and of its protein product. 

Cells need mechanisms that can recognize environmental conditions in which they should activate or repress transcription of the relevant genes. They also must be able to toggle on or off the transcription of each specific gene or group of genes. 

The allosteric site acts as a toggle switch that sets the DNA-binding domain in either functional or nonfuncitional mode. The allosteric effector binds to the allosteric site of the regulatory protein such that it changes the structure of the DNA-binding domain. An allosteric transition is a change in shape. 

A trans-acting gene can regulate all structural lac operon genes. 

In prokaryotes all genes are transcribed into RNA by the same RNA polymerase. In eukaryotes only three RNA polymerases function. 
RNA transcripts are extesively processed during transcrition in eukaryotes. The 5’ and the 3’ ends are modified and introns are spliced out. 
RNA polymerase II is much larger that its prokaryotic counterpart. It must synthesize RNA and coordinate the special processing events unique to eukaryotes. 

Activated transcription requires the binding of regulatory proteins to cis-acting elements in the DNA around the gene. 
Combinatorial interaction  is where complex patterns require many binding sites for different regulatory proteins to interact with each other. 

A dosage imbalance is corrected by dosage compensation which makes the amount of most gene products from two copies of X chromosomes in females equivalent to the single dose of X chromosome in males. 

Friday, July 8, 2016

Summer Research Week 2 & 3 (07/11 - 07/15 & 07/18 - 07/22)

Since you will be on a trip with limited internet access, I believe that the best assignment for you will be using the time to enhance and build your biology foundation (especially in Genetics and Molecular Biology). Please download the following book: An Introduction to Genetic Analysis in pdf format and carry that with you.
  • Browsing through  
    • Chapter 6- From Gene to Phenotype,  
    • Chapter 7- DNA: Structure and Replication,  
    • Chapter 8- RNA: Transcription and Processing,  
    • Chapter 9- Proteins and their Synthesis, and  
    • Chapter 10- Regulation of Gene Transcription. 
Get the main concepts and terminologies without worrying about the details. Take notes on the definitions of key vocabularies, and document the important processes. Since you haven't taken AP Biology yet, this reading will give you what you need to read the more advanced research literature.

Thursday, July 7, 2016

Notes - Week 1

Notes
  • Bioinformatics is the field of science in which biology, computer science, and informational technology meet.
  • Bioinformatics deals with the application of computers to the collection and analysis of biologic data to solve biological problems.
  • Bioinformatics has become popular due to the demand for means of storing and managing complex biological data sets.
  • The goal of bioinformatics is to find new discoveries in the biological field.
  • Common uses of bioinformatics are the analysis of amino acid sequences, protein domains, protein structures, and nucleotide.
  • Bioinformatics started with Gregor Mendel and a need to record a vast amount of genetic traits that he collected in his experiment.
  • Electrophoresis is a technique that separates macromolecules based on size.
  • A famous project is the Human Genome Project that fueled bioinformatics by creating a need for databases.
  • Development of computational algorithms that help us understand biological processes. The goal is to improve sectors such as agriculture where it can be used, for example, to increase the nutritional content of crops.
  • Bioinformaticians use computational tools and systems to answer problems of biology. Computational biologists develop theories, algorithms, and techniques for the tools bioinformaticians use.
  • Biological databases are data bases of sequence data from genome sequences from various projects. It helps explain biological phenomena and assists in fighting diseases and creating medications.
  • NCBI is the National Center for Biotechnology Information and is the premier website for biomedical research and can be divided into many sub databases.
  • Genbank sequence is a collection of all public nuecliatoid sequences and protein translation.
  • Bioinformatics is using java in a computer programing form.
  • Common tasks include string manipulation and file parsing.
  • Biodiversity Databases collect a wide variety of information on a wide range of species. Computer simulations, with this information, simulates, for example, population dynamic.
  • The most well known application of Bioinformatics is sequence analysis.
  • Shotgun sequencing is easy and quick and compare known sequences of a genome to specific mutations.
  • Blast searches Genomes of organisms and uses dynamic programing.
  • FASTA is used to represent either nucleated sequences or peptide sequences.
  • Homology modeling is predicting the structure of a protein once the structure of a homologous protein is known.
  • Sequencing of plant and animal genomes can improve the quality of crops and livestock.