/***************************************************************************
 *   Copyright (C) 2008 by Wolfgang Gerlach                                *
 *                                                                         *
 *                                                                         *
 *   BALSAM (BAsic fiLter for Semigobal non-gapped AlignMent search),      *
 *   part of the SWIFT software suit,                                      *
 *   is free software; you can redistribute it and/or modify  it under     *
 *   the terms of the GNU Lesser General Public License as published by    *
 *   the Free Software Foundation; either version 2 of the License, or     *
 *   (at your option) any later version.                                   *
 *                                                                         *
 *   This program is distributed in the hope that it will be useful,       *
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of        *
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         *
 *   GNU Lesser General Public License for more details.                   *
 *                                                                         *
 *   You should have received a copy of the GNU Lesser General Public      *
 *   License along with this program; if not, write to the                 *
 *   Free Software Foundation, Inc.,                                       *
 *   59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.             *
 *                                                                         *
 *   E-mail: wgerlachXcebitec.uni-bielefeld.de , where X=@                 *
 ***************************************************************************/



SWIFT BALSAM (BAsic fiLter for Semigobal non-gapped AlignMent search)
A tool to find those reads in a set of reads (or any other DNA-sequences)
which have a perfect k-mer match against a given bigger sequence (e.g. a genome).

Where BLAST needs a long time (hours or days) to align millions of reads against a genome,
BALSAM can do this in a few minutes. It therefore can be used as a filter in a
preprocessing step to find all reads with a k-mer match (and a given similarity threshold).
The much smaller set of matching reads can then be "blasted" against the genome in significantly less time.


If you use SWIFT BALSAM, please cite the author and the url http://bibiserv.techfak.uni-bielefeld.de/swift/ 
Author: Wolfgang Gerlach ( http://www.cebitec.uni-bielefeld.de/~wgerlach/ )



1) ****** Installation *****

> g++ -O3 -o balsam balsam.cpp


Alternatively for 64bit, if it is not already default:
> g++ -m64 -O3 -o balsam balsam.cpp

Currently BALSAM uses the BOOST library, if necessary give path, e.g.:
> g++ -I /vol/gi-sw/include/ -m64 -O3 -o balsam balsam.cpp
(recommended for the cebitec)



2) ***** Usage *****

Usage: balsam -d database -i query -o output [OPTIONS]

  obligatory:
  -d database        input file in FASTA format
                       (currently only first sequence in file will
                        be used)
  -i query           input file in FASTA format
  -o output          output file for matching queries in FASTA format

  optional:
  -q int             seed length (default 11)
  -l int             maximal length of a query (default 200)
  -s double          minimal similarity in the range 0-1 (default 0.6)
                       with "-s 0" every query with seed-match will
                       be reported (faster because there is no further
                       verification)
  -n                 no reverse complement (default yes)
  -f format          output format, currently supported:
                       fasta: no modifications (default)
                       balsam: similar to fasta
                       (swift: under construction)
  -r 1|2             1: report only first seed-match of a query
                       (faster, implies -s 0.0 !) (default no)
                     2: report only first match of a query 
                       (respects "-s", not necessarily the best match!)
                       (default no)
  -c int             special modus to count q-grams in database
                       list only q-grams with occurrence >= int
                       only options "-q" and "-d" are needed here
  -v                 contact and license information

  examples:
 "balsam -d genome.fas -i reads.fas -o results.fas -q 12 -s 0.9 -l 300"
 "balsam -d genome.fas -c 10 -q 12 | sort"


Please be aware that that SWIFT BALSAM was written to map short reads and thus can handle only single-line FASTA
sequences! If you have such kind of data, either modify your data accordingly, or feel free to modify
the source code... ;-)



3) ***** Time and space usage *****

Space:

1) subject sequence: n bytes
2) q-gram hash:      32bit-system: 4^q * 4 bytes (e.g. q=13: 256 MB; q=14: 1024 MB;
						q>14 can not be managed on 32bit)
                     64bit-system: 4^q * 8 bytes (e.g. q=13: 512 MB; q=14: 2048 MB; q=16: 16 GB ...)
3) q-gram array:     32bit-system: 4n bytes  (very similar to a suffix array!)
                     64bit-system: 8n bytes
4) array of results: O(#matches)

The parameter q mainly affects the space usage. In general, the bigger q the faster the program. But
 be careful to choose a q such that memory usage does not exceed the main memory space.


Time:

n=subject length; m=query length
1) index construction = O(4^q)+O(n)
2) search per query: O(m)+O((#matches)^2)




Feb 16th, 2010
Wolfgang Gerlach


