



                                    PHASE4


                 AUTOMATIC EVALUATION OF DATABASE SEARCH METHODS


                              Version 1.6, 2002



Marc Rehmsmeier

Email: marc@techfak.uni-bielefeld.de


Please cite:

"Phase4 --- automatic evaluation of database search methods."


Table of Contents
-----------------

Introduction

Installation

The Config File

The Graphical User Interface
  Constructing an Evaluation Environment
  Running Methods
  Evaluating Methods
  Producing Figures and Tables

Customization
  Adding Constructors
  Adding Methods
  Adding Evaluators
  Adding Reporters

Command-line parameters


Introduction
------------

Please read the accompanying paper.


Installation
------------

Phase4 is implemented in Python and should work with Python versions 2.0 or
higher. For the graphical user interface, Pmw (Python mega widgets) and tkinter
have to be installed. Apart from unpacking the tar file, no further ado should
be necessary. If the Python path is different to /vol/python/bin/python, just
change the affected Phase4 programs.


The Config File
---------------

The file config.py contains most of the configuration information for Phase4
such as possible parameters, database locations, and system commands. You
should at least adapt the variables ADDONS and PHASE4_PATH and the paths to the
various used programs such as the HMMER package or BLAST. Note that program
packages of the methods to be compared have to be downloaded and installed
separately.


The Graphical User Interface
----------------------------

The Graphical User Interface (GUI), which is implemented in phase4.py, is
divided into different set-up sheets which can be accessed by clicking on the
buttons in the top row. The sheets are called Data, Methods, Execution,
Evaluation, and Reporting. After having chosen parameters from these sheets,
the four phases of database evaluation are executed by clicking on the
corresponding buttons in the bottom row. The phases are Construct, Run,
Evaluate, and Report. There is also a Quit button which quits the GUI. The
different sheets and phases are explained in the following paragraphs.


Constructing an Evaluation Environment
--------------------------------------

The first phase is to construct an evaluation environment, i.e. to parse a
source database (e.g. SCOP pdb90), to extract training and test data, and to
build multiple alignments, HMMs, or whatever the methods to be tested need as
queries. For the construction phase, the user has to choose parameters on the
set-up sheets Data and Methods. The construction phase is then started by
pressing the Construct button. The parameters are as follows:

Data sheet
----------

workpath

  The path where the evaluation environment is constructed. All files such as
  training and test data, method outputs, performance measures, and report
  files are saved in automatically chosen directories and files, starting at
  the given workpath. Make sure that there is enough disk space.

scop

  In the current version of Phase4, only SCOP-like databases (as in versions
  1.37 or 1.53, e.g.) can be used as source databases. You can use your own
  database as long as it has the SCOP annotation structure.

targetdb

  Not interesting here. See Section "Running Methods".

Constructors

  The constructors determine the evaluation settings. At least one has to be
  chosen, but one can also use several simultaneously. The sizes, i.e. number
  of sequences, of the extracted families and the remaining superfamily have to
  be in certain limits, as defined in constraints.py.

  distant family one model

    From a superfamily, each family in turn is chosen to be a test
    family. From the remainder of the superfamily, one model (e.g. an HMM) is
    constructed that is later used to search the database.

  distant family one model nonredundant

    This setting is like the preceding one, but only the first family is chosen
    to be a test family. Thus, each superfamily produces only one test set.

  distant family one per family

    From a superfamily, each family in turn is chosen to be a test family. For
    each of the remaining families, one model (e.g. an HMM) is constructed that
    is later used to search the database.


  distant family one per family nonredundant

    This setting is like the preceding one, but only the first family is chosen
    to be a test family. Thus, each superfamily produces only one test set.

  distant family single sequences

    This setting is like "distant family one model", but each sequence of the
    remainder of the superfamily is used separately as a query for the methods
    to be tested. Thus, the setting results in a family pairwise search.

  distant family single sequences nonredundant

    This setting is like the preceding one, but only the first family is chosen
    to be a test family. Thus, each superfamily produces only one test set.

  distant family one sequence

    From a superfamily, each family in turn is chosen to be a test family. Each
    of the remaining sequences is later used as a query.

  family halves one model

    From each family of a superfamily, half its sequences are chosen as
    training, the remaining sequences as test sequences. From the training
    sequences (drawn from all families of the superfamily), one model (e.g. an
    HMM) is constructed that is later used to search the database.

  family halves one per family

    From each family of a superfamily, half its sequences are chosen as test
    sequences. For each of the remaining family halves, one model (e.g. an HMM)
    is constructed that is later used to search the database.

  family halves single sequences

    This setting is like "family halves one model", but each training sequence
    is used separately as a query for the methods to be tested. Thus, the
    setting results in a family pairwise search.

  family half one model

    From a family, half its sequences are chosen as test sequences, the
    remaining family sequences as training sequences. The sequences of
    the surrounding superfamily will be ignored in the evaluation.

  superfamily refind one model

    The whole superfamily serves both as a single training and test set.

  family refind one model

    The whole family serves both as a single training and test set. The
    sequences of the surrounding superfamily will be ignored in the evaluation.


Methods sheet
-------------

    The training data are constructed according to the methods to be tested.
    The process is very much organised like the execution of a makefile. For
    each method chosen on the Methods sheet, the necessary file objects like
    multiple alignments or HMMs are constructed. What objects are necessary is
    defined in the method classes in methods.py and in the object classes in
    objects.py.


Running Methods
---------------

The second phase is to run the methods to be tested in the constructed
evaluation environment. For the execution phase, the user has to chose from the
Data and the Method sheets as in the construction phase and from the Execution
sheet. The execution phase is then started by pressing the Run button.
(Additional) parameters are as follows:

Data sheet
----------

targetdb

  The database that is searched by the methods to be tested can be different
  from the source database the training and test data was extracted from (see
  parameter scop). This could be useful, e.g., to assess how prone a method is
  to the size of a database, by adding randomly generated sequences to the
  source database. If the parameter targetdb is not given, the source database,
  given by the parameter scop, is searched.

Execution sheet
---------------

Caller

  The callers determine if the methods should be executed on the local machine
  (the one the phase4 GUI is running on) or be distributed in a cluster via
  load share facility (LSF) or sun grid engine (SGE). LSF and SGE have to be
  downloaded and installed separately (if you want to use them).

  local

    Execute the methods on the local machine.

  lsf

    Distribute the methods via the load share facility. Adapt the LSF_COMMAND
    in config.py to your requirements, especially concerning the queue name.

  sge

    Distribute the methods via the sun grid engine. Adapt the SGE_COMMAND in
    config.py to your requirements, especially concerning the queue name.


caller output

  If using LSF or SGE, one might be interested in e.g. the assigned job numbers
  or the number of jobs that have finished already. This can be determined from
  the output of the queueing commands (bsub for LSF and qsub for SGE). The
  caller output can be saved to any file. The default value is /dev/null, thus
  discarding the output.



Evaluating Methods
------------------

The third phase is to evaluate the methods that were run in the previous, the
execution phase. For the evaluation phase, the user has to chose from the Data
and the Methods sheets as in the construction and the execution phases and from
the Evaluation sheet. The evaluation phase is then started by pressing the
Evaluate button. (Additional) parameters are as follows:

Evaluation sheet
----------------

Evaluators

  From the sorted and possibly combined outputs (see below), various measures
  can be calculated which express the methods' search sensitivity and
  selectivity. For each of the test cases the chosen evaluation measures is
  calculated and saved. This results in as many files as chosen evaluation
  measures and in as many lines per file as there are test cases. In the
  following, the positives are the test sequences, the negatives non-homologous
  target sequences. The training sequences are ignored.

  eqnr

  Equivalence numbers. The equivalence number is the number of false positives
  at a score (or E-value) threshold at which this number equals the number of
  false negatives.

  releqnr

  Relative equivalence numbers. The relative equivalence number is the
  equivalence number divided by the size of the test case (number of test
  sequences).

  minfpcount

  Minimum false positive counts. The minimum false positive count is the
  number of negatives that rank higher than the first positive.

  medfpcount

  Median false positive counts. The median false positive count is the
  median of the false positive counts (number of higher ranking negatives) of
  all positives.

  maxfpcount

  Maximum false positive counts. The maximum false positive count is the number
  of negatives that rank higher than the last positive.

  sumfpcount

  Sum false positive counts. The sum false positive count is the sum of the
  false positive counts (number of higher ranking negatives) of all positives.

  minfprate

  Minimum false positive rates. The minimum false positive rate is the minimum
  false positive count divided by the number of targets (positives and
  negatives). 

  medfprate

  Median false positive rates. The median false positive rate is the median
  false positive count divided by the number of targets (positives and
  negatives). 

  maxfprate

  Maximum false positive rates. The maximum false positive rate is the maximum
  false positive count divided by the number of targets (positives and
  negatives). 

  covfpcount

  Coverage false positive counts. For a range of percent coverage, which is the
  number of true positives divided by the total number of positives times 100,
  the number of false positives is calculated. Thus, for each test set, a list
  of false positive counts is given, the elements corresponding to the
  coverages (by default from 1 to 100 percent in steps of 1).

  tpcount0

  Strict true positive count. The tpcount0 is the number of positives that
  score better than the best-scoring negative.

  tpcount50

  True positive count up to 50. The tpcount50 is the number of positives that
  score equally well or better than the 50 best-scoring negatives.

  roc50

  Receiver operating characteristic. The ROC50 measure is the sum of all true
  positive counts up to 50. In other words, for all numbers of false positives
  between 1 and 50, the corresponding number of true positives is calculated,
  and all these numbers are summed up.

  copylist

  This just returns for each test case a list of the first true negatives, the
  length of this list determined by COPYLISTLENGTH in the config file (default
  is 100). The list consists either of scores or of E-values, depending on the
  chosen option (see Sort styles below). Copylist can be useful, e.g., for
  comparing predicted numbers of false positives (E-values) with actual
  numbers.

Sort styles

  The output hits can be sorted according to their score or their evalue
  (provided that both are present in the output). This might affect the
  calculated measures such as minfpcount (see Evaluators above).

  by score

    Output hits are sorted according to their score.

  by evalue

    Output hits are sorted according to their E-values.

Combine styles

  If for a test set more than one training set was used (e.g. in the
  "one per family" or the "single sequences" settings), each test sequence
  is associated with several scores or E-values, originating from the
  several models. To obtain exactly one score or E-value for each test
  sequence, the results can be combined by choosing the best or by calculating
  the average.

  best

    Combine results by choosing the best.

  avrg

    Combine results by calculating the average.



Producing Figures and Tables
----------------------------

Once the methods have been evaluated, the results can be presented in figures
and tables, which is the fourth phase. The user has to choose from the Data,
the Methods, the Evaluation, and the Reporting sheet. The report phase is then
started by pressing the Report button. (Additional) parameters are as follows:

Reporting sheet
---------------

Reporter

  From the calculated evaluation measures such as minimum false positive counts
  (minfpcount), various graphical and tabular summaries can be produced:

  ascii table

  Prints all chosen measures for all chosen methods in an ASCII table. The
  table is divided into sections, each section representing a superfamily. For
  each such section, the test families are given. If a dictionary is given (see
  below), superfamilies and test families are denoted by their names, otherwise
  by their scop code.

  latex table

  Prints all chosen measures for all chosen methods in a latex table. The
  table is divided into sections, each section representing a superfamily. For
  each such section, the test families are given. If a dictionary is given (see
  below), superfamilies and test families are denoted by their names, otherwise
  by their scop code.

  performance figure

  Produces a diagram, in which, for cut-offs on the x-axis, for chosen measures
  and the chosen methods, the percentage of test cases is given, in which a
  method had that measure below or equal to the cut-off.

  unique latex table

  Prints for a chosen measure and each of the chosen methods the test cases in
  which the method had that measure below or equal to a certain cut-off and all
  other chosen methods had that measure above the cut-off. The result is a
  latex table.

  unique figure

  Produces a diagram, in which, for cut-offs on the x-axis, for a chosen
  measure and each of the chosen methods the number of test cases is given on
  the y-axis in which the method had that measure below or equal to the cut-off
  and all other chosen methods had that measure above the cut-off.

  cov fp plot

  Coverage false positive plot. For each chosen method and for a range of false
  positives on the x-axis, the average coverage (averaged over all test sets)
  is plotted on the y-axis. On the evaluation sheet, covfpcount has to be
  selected.

  cov logfp plot

  Coverage log false positive plot. The same as cov fp plot, but with the
  logarithms of the false positive counts. On the evaluation sheet, covfpcount
  has to be selected.

  correlation plot

  For two chosen methods, a diagram is produced, in which for each test case
  the two measures (from the two methods) are plotted as a dot with the
  corresponding coordinates.

  copylist plot

  On the evaluation sheet, copylist has to be selected. For each index of the
  copy lists (by default between 1 and 100, see COPYLISTLENGTH in config.py),
  the average over all test sets of the list entries is plotted on the y-axis.
  If the sort style is "by evalue", the copylist plot compares actual numbers
  of false positives on the x-axis with predicted ones (E-values) on the
  y-axis.

  avrg one dim plot

  The chosen methods are enumerated on the x-axis, and for each method, the
  average of all test cases under a chosen measure is plotted on the y-axis.
  This is particularly useful for optimizing a method with respect to a certain
  parameter. See Method Generators below.

  avrg two dim plot

  The chosen methods are enumerated on the x-axis and on the y-axis, and for
  each method, the average of all test cases under a chosen measure is plotted
  on the z-axis. This is particularly useful for optimizing a method with
  respect to two parameters. See Method Generators below. WARNING: the number
  of points in the first dimension is currently hard coded in the file
  report.py!

  signed rank test

  For all pairs of chosen methods, a Wilcoxon signed rank test is performed,
  which, for a given significance level, tells if the performance of one method
  is significantly different from the performance of the other method. As
  performance measures one can choose all those that assign a single number to
  a test family, e.g. median false positive counts (see Evaluators above). The
  measures can be cut at the bottom and the top, if one wants to restrict the
  analysis to a certain interval.

Report name

  From given reporter and report name, three output file names are derived.
  For example, choosing a performance figure with the report name "test"
  results in "test.performance_figure" with three suffixes, namely ".dat",
  ".gnuplot", and ".ps". They are written to a directory "reports" which can be
  found in <workpath>/<scop>. For example, if the workpath is /homes/juser/data
  and the scop database is pdb90d_1.37, the report files can be found in
  /homes/juser/data/pdb90d_1.37/reports. The gnuplot file can be edited
  manually to adapt the appearance of the graphics (e.g. with respect to the
  method names) or to create other file formats (e.g. gif instead of
  postscript). Having changed the gnuplot file, just type "gnuplot <file>" to
  create a new output.

Dictionary

  In the scop database, sequences are classified according to a scop code, from
  which Phase4 derives training and test sets. To produce tables with
  meaningful superfamily and family names, a dictionary has to be given that
  for each superfamily and each family scop code assigns a name (e.g., in
  pdb90d_1.37, 7.1.1.1 Insulin-like). For scop versions shipped with the Phase4
  distribution, the dictionaries can be found in the addons directory.

Common Subset

  Different constructors (see Data sheet above) might result in different test
  sets due to different construction constraints. By choosing the Common Subset
  option, only those test cases are included into the report that occur in all
  selected settings.

Select test de reg exp

  It can be interesting to restrict a report to certain subsets of possible
  test sets. These subsets can be selected by giving a regular expression that
  matches the description of the desired test sequences.

Max x

  This is the maximum value on the x-axis of the produced diagram. The default
  value is defined in the config file.

Step

  This is the step size on the x-axis of the produced diagram. The default
  value is defined in the config file.

Significance level

  The desired significance level for the Wilcoxon signed rank test (see above)

Bottom cut

  For the Wilcoxon signed rank test (see above), the performance measures to be
  analysed can be cut at the bottom, i.e. all values that are below the "bottom
  cut" value are assigned this value. If this field is left empty, measures are
  not restricted at the bottom.

Top cut
  
  For the Wilcoxon signed rank test (see above), the performance measures to be
  analysed can be cut at the top, i.e. all values that are above the "top
  cut" value are assigned this value. If this field is left empty, measures are
  not restricted at the top.


Customization
-------------

Adding Constructors

  1. Add a new constructor option in config.py
  2. Define a new construction function in constructors.py
  3. Augment the constructors dictionary in constructors.py

Adding Methods

  1. Add a new method option in config.py
  2. Define a new method class in method.py
  3. Define a new output filter in filters.py if necessary
  4. Augment the methods dictionary in methods.py
  5. Augment the bestscore dictionary in methods.py

  Method Generators

  Method generators are used to define new methods automatically, especially
  by iterating over parameters. See Ssearch_generator in methods.py for an
  example.

Adding Evaluators

  1. Add a new evaluator option in config.py
  2. Define a new evaluator function in evaluators.py
  3. Augment the evaluators dictionary in evaluators.py

Adding Reporters

  1. Add a new reporter option in config.py
  2. Define a new reporter function in reporters.py
  3. Add a new reporter case at the end of report.py



Command-line parameters
-----------------------

All phases can be accomplished on the command line. The programs and their
command line parameters are as follows:

Constructors

  --distant_family_one_model
  --distant_family_one_model_nonredundant
  --distant_family_one_per_family
  --distant_family_one_per_family_nonredundant
  --distant_family_single_sequences
  --distant_family_single_sequences_nonredundant
  --distant_family_one_sequence

  --family_halves_one_model
  --family_halves_one_per_family
  --family_halves_single_sequences

  --family_half_one_model

  --superfamily_refind_one_model
  --family_refind_one_model


Methods

  --hmmsearch_local
  --hmmsearch_domain
  --hmmsearch_global

  --sam_local
  --sam_domain
  --sam_global

  --pfsearch_local
  --pfsearch_domain
  --pfsearch_global

  --ssearch_blosum50
  --ssearch_blosum62

  --jsearch

  --blastp_blosum50
  --blastp_blosum62
  --blastp2_blosum50
  --blastp2_blosum62


Method generators

  --jsearch_generator
  --ssearch_generator
  --blastp2_generator
  --pfsearch_generator
  --test_method_generator

Evaluators

  --eqnr
  --releqnr

  --minfpcount
  --medfpcount
  --maxfpcount
  --sumfpcount
  --minfprate
  --medfprate
  --maxfprate

  --covfpcount

  --tpcount0
  --tpcount50

  --roc50

  --copylist

Reporters

  --ascii_table
  --latex_table
  --performance_figure
  --unique_latex_table
  --unique_figure
  --cov_fp_plot
  --cov_logfp_plot
  --correlation_plot
  --copylist_plot
  --avrg_one_dim_plot
  --avrg_two_dim_plot
  --signed_rank_test


Construction: construct.py

  constructors, methods, method generators

  --workpath <path>
  --scop <scop fasta file>

Execution: run.py

  constructors, methods, method generators

  --workpath <path>
  --scop <scop fasta file>
  --targetdb <fasta file>
  --local
  --lsf
  --sge

Evaluation: evaluate.py

  constructors, methods, method generators, evaluators

  --workpath <path>
  --scop <scop fasta file>
  --targetdb <fasta file>
  --best
  --avrg
  --by_score
  --by_evalue

Reporting: report.py

  constructors, methods, method generators, evaluators, reporters

  --common_subset
  --select_test_de_reg_exp <regular expression>
  --max_x <number>
  --step <number>
  --dictionary <scop name dictionary>
  --significance_level <number>
  --bottom_cut <number>
  --top_cut <number>
