Introduction
FIGG is a genome simulation tool that uses known or theorized variation frequency, per a given fragment size and grouped by GC content across a genome to model new genomes in FASTA format while tracking applied mutations for use in analysis tools or population simulations.
FIGG uses Apache MapReduce and HBase to rapidly generate individual genomes and allow users to scale up generation to fit specific project needs.
Instructions for Local Installation
Hadoop version 1.2 and HBase 0.94 is required to run this on a local cluster or single node setup. Please see the documentation provided by Apache in order to set up a single node installation of Hadoop.
Setup
- Download the HBase-Genomes-1.1.jar file from the releases page.
- Download normal-freq-hbase.tgz. This contains the "normal" variation frequency database based on 1000Genomes and HapMap.
Uncompress and load the directories (not single files!) into HDFS (or s3) and load into HBase:hadoop dfs -copyFromLocal /path/to/normal-freq /my/hdfs/path
hadoop jar HBase-Genomes-1.1.jar hbaseutil -d /my/hdfs/path/to/normal-freq -c IMPORT
- Download FASTA files for GRCh37 (or other human reference release). UCSC provides these as separate chromosomes, ignore the chr*_random and chrUn_* files. Load these into HDFS (or s3):
hadoop dfs -copyFromLocal /path/to/FASTAfiles /my/hdfs/path
Then run the 'fastaload' job:hadoop jar HBase-Genomes-1.1.jar fastaload -g [genome name] -f [hdfs/s3 path to FASTA file directory]
If you are running on AWS make sure you also run the hbaseutil EXPORT job immediately following this. - Each mutate job generates one new genome. A new genome name is required each time. This job generates mutated fragments and stores them in HBase. This can be run as many times as required.
hadoop jar HBase-Genomes-1.1.jar mutate -p GRCh37 -m MyNewGenomeName
-
This step generates all of the FASTA files for each chromosome in any genome identifed by name. This can be run anytime, as many times as needed, after the mutation step. If you try to generate the same genome twice, any existing
files for that genome in HDFS will be overwritten.
hadoop jar HBase-Genomes-1.1.jar gennormal -g MyNewGenomeName -o /my/hdfs/path
Run
The resulting FASTA files can be copied out of HDFS using
hadoop -dfs -copyToLocal ...
Optional: Generate your own variation frequencies
Please see the R directory under 'fragment-database/normal' to generate files of the appropriate format. This is not set up as a Hadoop job, so these files can be loaded from the local file system (not HDFS) into HBase by running:
java -class HBase-Genomes-1.1.jar org.lcsb.lu.igcsa.hbase.ImportVariationData \ -v /local/path/variation.txt \ -g /local/path/gc_bin.txt \ -s /local/path/snv_prob.txt \ -z /local/path/size_prob.txt \ -f /local/path/variation_per_bin.txt
FAQ
- Does FIGG work with all genomes?
- Currently this has only been tested on human GRCh37. However, the tool itself will work with any genomic data in FASTA format. The database tables will need to be generated based on your specific genome and known/hypothesized variation information and the scripts available in the source download will require alteration to work for your files or new ones written.
- What data is provided?
- The data provided in the database tables were based on analysis of 1000Genomes and HapMap variations called against human reference GRCh37 (hg19). These variation tables can be used with any human genome sequence, however variation frequency may change between releases.
- Can I include specific variations at known locations?
- FIGG currently only provides a random variation location based on frequency data. Future implementations will include the ability to specify location based variations.
Available Tools
- Usage: hadoop jar HBase-Genomes-1.1.jar [program name] [args]
- fastaload
-g [genome name] -f [hdfs path to FASTA file directory] - hbaseutil
-d [hdfs directory for read/write] -c [IMPORT|EXPORT] -t [comma separated list of tables OPTIONAL] - mutate
-p [reference genome, ex. GRCh37] -m [new genome name] - gennormal
-g [reference genome name, ex. GRCh37] -o [hdfs output path for FASTA files]
Several shell scripts are provided with the source files which use the Amazon Command Line tool (CLI) and Ruby elastic-mapreduce (EMR) to submit and run the jobs. For this it is assumed that you will upload all of the tables provided to a folder in S3. The job specified will load all of the tables into HBase, run the specified jobs, and export the database tables (genome,sequence,chromosome,small_mutations) or generate new FASTA files (in S3).
These scripts set up Spot Instances for less expensive runs, however please keep in mind that running these jobs is charged to the owner of the account, this is not provided by us.
Please see Amazon Web Services documentation for information on setting up your AWS account, security, and S3.
Software License
Copyright 2013 University of Luxembourg and the Luxembourg Centre for Systems Biomedicine Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.