Bauhaus
A Quantitative Evaluation Framework for Clustering Techniques
for Re-Modularization and Component Recovery
Information
on the Bauhaus project
Overview
Background Information and
Goals
Component recovery supports program understanding, architecture recovery,
and re-use and is a prerequisite for re-modularization. Recovering static
components, i.e., modules and subsystems, has proven useful in a number
of ways and many methods to automatically or semi-automatically detect
static components have been published in the literature.
However, their relative strengths and weaknesses are unknown to a great
extent.
Experimental quantitative evaluation of clustering techniques for component
recovery is necessary in order to analyze their strengths and weaknesses
in comparison to other techniques. However, there has not been any accepted
evaluation scheme so far, which would allow to compare different techniques,
since Lakhotia and Gravely have pointed
out this grievance in 1995. The goal of our initiative is to establish
such an evaluation scheme. The basic idea is to compare components as detected
by automatic techniques, so-called candidates, to components manually
identified by human software engineers or programmers, so-called references
and then to measure precision and recall of the set of proposed candidates
with respect to the set of references. To this end, we need:
-
a common reference corpus of freely available systems, for which the actual
components are known. The reference corpus is used to measure recall and
precision of automatic techniques.
-
a standard scheme for measureing the correspondence of the components recovered
by a clustering technique to components in the reference corpus
In our contribution "A Framework for Experimental
Evaluation of Clustering Techniques" to the International Workshop
on Program Comprehension, 2000, in Limerick, we described a process of
setting up reference corpora and ways of measuring recall and precision
of automatic clustering techniques in details (slides
of the presentation are also available). Here, we will make available
our framework for quantitative evaluation of fully automatic clustering
techniques for software modules. The framework consists of a specification
of a quantitative evaluation method, a tool that supports this method,
and manually compiled sets of software modules of diverse systems that
are used to benchmark the automatic techniques.
Limerick
In the spirit of Limerick, where our framework was first presented:
She comes up to Limerick with a new way to cluster,
but can we really trust'er?
If she picks up this theme,
if she uses our scheme,
she will be our master.
Quantitative Evaluation Scheme
The quantiative evaluation scheme is described in our paper "A
Framework for Experimental Evaluation of Clustering Techniques", Proceedings
of the International Workshop on Program Comprehension, 2000, IEEE Computer
Society Press.
Freely available tool support
We offer our tool that implements the evaluation, given a set of references
and a set of candidates. It can be freely downloaded. However, we expect
feedback from users (and we don't give any warranties whatsoever of course).
Executables are available for the following platforms (N.B.: Use the "Save Link as" facility of your browser for the following two links; these files are executables):
The tool is called with two arguments: the file with the candidates (using
-cand) and the file with the references (using -ref), e.g.,
detection_quality -cand candidates.re -ref references.re
An option -detailed may be used to print the pairs of corresponding
components. Use
detection_quality -help
for a full description of all options.
The input in both files must be according to the following grammar rules
(in EBNF; comments start with an % and end with the next new line):
File
::= <empty file> | Components .
Components
::= Component { ',' Component }.
Component
::= '{' [Component_Name] [Entities] '}' .
Entities
::= Entity { ',' Entity } .
Entity
::= '<' System_Name ',' File_Name ',' Object_Name '>'
.
Component_Name ::= '%NAME' STRING .
System_Name ::= STRING .
File_Name
::= STRING .
Object_Name ::= STRING
.
STRING
::= '"' <any character except " and newline > '"' .
Here is an example with two components:
{%NAME "ConceptArray"
<"concepts", "concept.h", "ConceptArray">,
% a type
<"concepts", "concept.c", "ConceptSubconcepts">,
% a subprogram
<"concepts", "concept.c", "ConceptLattice">,
% another type
<"concepts", "concept.c", "ConceptArrayDelete">,
<"concepts", "concept.c", "ConceptArrayInit">
},
{%NAME "Concept"
<"concepts", "concept.h", "Concept">,
<"concepts", "concept.c", "ConceptPrint">,
<"concepts", "concept.c", "ConceptDelete">,
<"concepts", "concept.c", "ConceptInit">
}
Two larger examples are:
Benchmark Systems
Altogether, we have four benchmark systems, for which reference components
were manually obtained by software engineers:
Aero
-
Version 1.7, 31 (commented) Kloc of C code, 57 user-defined types,
480 global variables and constants, 488 subprograms. Aero is an X-window-based
simulator for rigid body systems.
-
-
Bash
-
Version 1.14.4, 38 (commented) Kloc of C code, 60 user-defined types,
487 global variables and constants, 1002 subprograms. Bash is a Unix shell.
-
-
CVS
-
Version 1.8, 30 (commented) Kloc of C code, 41 user-defined types,
386 global variables and constants, 575 subprograms. CVS is a system
for controlling concurrent software development.
-
-
Mosaic
-
Version 2.6 (without GUI), 37 (commented) Kloc of C code,
79 user-defined types, 269 global variables and constants, 564 subprograms.
-
A more detailed description on how the reference components for the systems
were obtained along with a profile of the human analysts can be found here.
These reference components are freely available, but we do expect feedback.
Please contact Rainer
Koschke.
Contact
For further information and suggestions, please contact Rainer
Koschke.
Some References
to Component Recovery Techniques
-
Belady, L.A., Evangelisti, C.J., `System Partitioning and its Measure',
Journal of Systems and Software, 2(1), pp. 23-29, February 1982.
-
Canfora, G., Cimitile, A., Munro, M., `An Improved Algorithm for Identifying
Objects in Code', Journal of Software Practice and Experience, 26(1), pp.
25-48, January 1996.
-
Choi, S.C., Scacchi, W., `Extracting and Restructuring the Design of Large
Systems', IEEE Software, 7(1), pp. 66-71, January 1990.
-
Doval, D., Mancoridis, S., Mitchel, B.S, Chen, Y., Gansner, E.R., 'Automatic
Clustering of Software Systems using a Genetic Algorithm', Proceedings
of the International Conference on Software Tools and Engineering Practice,
August 1999.
-
Gall, H., Klösch, R., `Finding Objects in Procedural Programs: An
Alternative Approach', Proceedings of the Second Working Conference on
Reverse Engineering, pp. 208-216, Toronto, July 1995.
-
Girard, J.F., Koschke, R., Schied, G., `A Metric-based Approach to Detect
Abstract Data Types and Abstract State Encapsulation', Journal on Automated
Software Engineering, 6, pp. 357-386, Kluwer 1999.
-
Girard, J.F., Koschke, R., `A Comparison of Abstract Data Type and Objects
Recovery Techniques', Journal Science of Computer Programming, Elsevier,
to appear.
-
Girard, J.F., Koschke, R., `Finding Components in a Hierarchy of Modules:
a Step Towards Architectural Understanding', International Conference on
Software Maintenance, pp. 58-65, Bari, October 1997.
-
Hutchens, D.H., Basili, V.R., `System Structure Analysis: Clustering with
Data Bindings', IEEE Transactions on Software Engineering, SE-11(8), pp.
749-757, August 1985.
-
Kazman, R., Carrière, S.J., `Playing detective: reconstructing software
architecture from available evidence', Technical Report CMU/SEI-97-TR-010,
ESC-TR-97-010, Software Engineering Institute, Pittsburgh, USA, 1997.
-
Koschke, R., `An Semi-Automatic Method for Component Recovery', Proceedings
of the Sixth Working Conference on Reverse Engineering, pp.256-267, Atlanta,
October 1999.
-
Koschke, R. (2000), `Atomic Architectural Component Detection for Program
Understanding and System Evolution', Ph.D. thesis. University of Stuttgart,
to appear.
-
Lakhotia, A., `A Unified Framework for Expressing Software Subsystems Classification
Techniques', Journal Systems Software, Elsevier Science Publisher, 36,
pp. 211-231, 1997
-
Lakhotia, A., Gravely, J.M., `Toward
Experimental Evaluation of Subsystem Classification Recovery Techniques',
Proceedings of the Second Working Conference on Reverse Engineering, pp.
262-269, Toronto, July 1995.
-
Lindig, C., Snelting, G., `Assessing Modular Structure of Legacy Code Based
on Mathematical Concept Analysis', Proceedings of the Nineteenth International
Conference on Software Engineering, Boston, 1997.
-
Liu, S.S., Wilde, N., `Identifying Objects in a Conventional Procedural
Language: An Example of Data Design Recovery', Proceedings of the International
Conference on Software Maintenance, pp. 266-271, November 1990.
-
Müller, H., Wong, K., Tilley, S., `A Reverse Engineering Environment
Based on Spatial and Visual Software Interconnection Models', Proceedings
of the Fifth ACM SIGSOFT Symposium on Software Development Environments,
pp 88-98, Tyson's Corner, December 1992.
-
Müller, H.A., Orgun, M.A., Tilley, S.R., Uhl, J.S., `A Reverse Engineering
Approach to Subsystem Structure Identification'. Journal of Software Maintenance:
Research and Practice, 5(4), pp. 181-204, December 1993.
-
Ogando, R.M., Yau, S.S., Wilde, N., `An Object Finder for Program Structure
Understanding in Software Maintenance', Journal of Software Maintenance,
6(5), pp. 261-83, September-October 1994.
-
Patel, S., Chu, W., Baxter, R., `A Measure for Composite Module Cohesion',
Proceedings of the Fourteenth International Conference on Software Engineering,
pp. 38-48, Melbourne, May 1992.
-
Sahraoui, H., Melo, W, Lounis, H., Dumont, F., `Applying Concept Formation
Methods to Object Identfication in Procedural Code', Proceedings of the
Twelfth Conference on Automated Software Engineering, pp. 210-218, Nevada,
November 1997.
-
Schwanke, R. W., `An Intelligent Tool for Re-engineering Software Modularity',
Proceedings of the International Conference on Software Engineering, pp.
83-92, May 1991.
-
Siff, M., Reps, T., `Identifying Modules via Concept Analysis', Proceedings
of the International Conference on Software Maintenance, pp. 170-179, Bari,
October 1997.
-
Tzerpos, V., Holt, R., `MoJo: A Distance Metric for Software Clusterings',
Proceedings of the Sixth Working Conference on Reverse Engineering, pp.
187-193, Atlanta, July 1999.
-
Valasareddi, R.R., Carver, D.L., `A Graph-based Object Identification Process
for Procedural Programs', Proceedings of the Fifth Working Conference on
Reverse Engineering, pp. 50-58, Honolulu, October 1998.
-
Yeh, A.S., Harris, D., Reubenstein, H., `Recovering Abstract Data Types
and Object Instances From a Conventional Procedural Language', Proceedings
of the Second Working Conference on Reverse Engineering, pp. 227-236, July
1995.
koschke@informatik.uni-stuttgart.de
Last modified: Wed Sep 6 12:08:29 MET DST
2000