Frequency-based Insilico Genome Generator

Introduction

FIGG is a genome simulation tool that uses known or theorized variation frequency, per a given fragment size and grouped by GC content across a genome to model new genomes in FASTA format while tracking applied mutations for use in analysis tools or population simulations.

FIGG uses Apache MapReduce and HBase to rapidly generate individual genomes and allow users to scale up generation to fit specific project needs.

Instructions for Local Installation

Hadoop version 1.2 and HBase 0.94 is required to run this on a local cluster or single node setup. Please see the documentation provided by Apache in order to set up a single node installation of Hadoop.

Setup

Download the HBase-Genomes-1.1.jar file from the releases page.
Download normal-freq-hbase.tgz. This contains the "normal" variation frequency database based on 1000Genomes and HapMap.
Uncompress and load the directories (not single files!) into HDFS (or s3) and load into HBase:
```
hadoop dfs -copyFromLocal /path/to/normal-freq /my/hdfs/path 
```
```
hadoop jar HBase-Genomes-1.1.jar hbaseutil -d /my/hdfs/path/to/normal-freq -c IMPORT
```
Download FASTA files for GRCh37 (or other human reference release). UCSC provides these as separate chromosomes, ignore the chr*_random and chrUn_* files. Load these into HDFS (or s3):
```
hadoop dfs -copyFromLocal /path/to/FASTAfiles /my/hdfs/path 
```
Then run the 'fastaload' job:
```
hadoop jar HBase-Genomes-1.1.jar fastaload -g [genome name] -f [hdfs/s3 path to FASTA file directory]
```
If you are running on AWS make sure you also run the hbaseutil EXPORT job immediately following this.

Run

Each mutate job generates one new genome. A new genome name is required each time. This job generates mutated fragments and stores them in HBase. This can be run as many times as required.
```
        hadoop jar HBase-Genomes-1.1.jar mutate -p GRCh37 -m MyNewGenomeName
        
```
This step generates all of the FASTA files for each chromosome in any genome identifed by name. This can be run anytime, as many times as needed, after the mutation step. If you try to generate the same genome twice, any existing files for that genome in HDFS will be overwritten.
```
        hadoop jar HBase-Genomes-1.1.jar gennormal -g MyNewGenomeName -o /my/hdfs/path
        
```

The resulting FASTA files can be copied out of HDFS using

hadoop -dfs -copyToLocal ...

Optional: Generate your own variation frequencies

Please see the R directory under 'fragment-database/normal' to generate files of the appropriate format. This is not set up as a Hadoop job, so these files can be loaded from the local file system (not HDFS) into HBase by running:

    java -class HBase-Genomes-1.1.jar org.lcsb.lu.igcsa.hbase.ImportVariationData \
    -v /local/path/variation.txt \
    -g /local/path/gc_bin.txt \
    -s /local/path/snv_prob.txt \
    -z /local/path/size_prob.txt \
    -f /local/path/variation_per_bin.txt

FAQ

Does FIGG work with all genomes?: Currently this has only been tested on human GRCh37. However, the tool itself will work with any genomic data in FASTA format. The database tables will need to be generated based on your specific genome and known/hypothesized variation information and the scripts available in the source download will require alteration to work for your files or new ones written.
What data is provided?: The data provided in the database tables were based on analysis of 1000Genomes and HapMap variations called against human reference GRCh37 (hg19). These variation tables can be used with any human genome sequence, however variation frequency may change between releases.
Can I include specific variations at known locations?: FIGG currently only provides a random variation location based on frequency data. Future implementations will include the ability to specify location based variations.

Available Tools

Usage: hadoop jar HBase-Genomes-1.1.jar [program name] [args]

Run on Amazon Elastic MapReduce:

Several shell scripts are provided with the source files which use the Amazon Command Line tool (CLI) and Ruby elastic-mapreduce (EMR) to submit and run the jobs. For this it is assumed that you will upload all of the tables provided to a folder in S3. The job specified will load all of the tables into HBase, run the specified jobs, and export the database tables (genome,sequence,chromosome,small_mutations) or generate new FASTA files (in S3).

These scripts set up Spot Instances for less expensive runs, however please keep in mind that running these jobs is charged to the owner of the account, this is not provided by us.

Please see Amazon Web Services documentation for information on setting up your AWS account, security, and S3.

Software License

Apache License 2.0

      Copyright 2013 University of Luxembourg and the Luxembourg Centre for Systems Biomedicine

      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License.