Original PDF Flash format a-comparison-of-bug-finding-tools-for-java  


A Comparison Of Bug Finding Tools For Java

A Comparison of Bug Finding Tools for Java
Nick Rutar
Christian B. Almazan
Jeffrey S. Foster
University of Maryland, College Park
{rutar, almazan, jfoster}@cs.umd.edu
Abstract
of the bug in different places (Section 4.1). We also found
that many tools produce a large volume of warnings, which
Bugs in software are costly and difficult to find and fix.
makes it hard to know which to look at first.
In recent years, many tools and techniques have been de-
Even though the tools do not show much overlap in par-
veloped for automatically finding bugs by analyzing source
ticular warnings, we initially thought that they might be cor-
code or intermediate code statically (at compile time). Dif-
related overall. For example, if one tool issues many warn-
ferent tools and techniques have different tradeoffs, but the
ings for a class, then it might be likely that another tool does
practical impact of these tradeoffs is not well understood. In
as well. However, our results show that this is not true in
this paper, we apply five bug finding tools, specifically Ban-
general. There is no correlation of warning counts between
dera, ESC/Java 2, FindBugs, JLint, and PMD, to a variety
pairs of tools. Additionally, and perhaps surprisingly, warn-
of Java programs. By using a variety of tools, we are able
ing counts are not strongly correlated with lines of code.
to cross-check their bug reports and warnings. Our experi-
Given these results, we believe there will always be a
mental results show that none of the tools strictly subsumes
need for many different bug finding tools, and we propose
another, and indeed the tools often find non-overlapping
creating a bug finding meta-tool for automatically combin-
bugs. We discuss the techniques each of the tools is based
ing and correlating their output (Section 3). Using this tool,
on, and we suggest how particular techniques affect the out-
developers can look for code that yields an unusual number
put of the tools. Finally, we propose a meta-tool that com-
of warnings from many different tools. We explored two
bines the output of the tools together, looking for particular
different metrics for using warning counts to rank code as
lines of code, methods, and classes that many tools warn
suspicious, and we discovered that both are correlated for
about.
the highest-ranked code (Section 4.2).
For our study, we selected five well-known, publicly
available bug-finding tools (Section 2.2). Our study focuses
1
Introduction
on PMD [18], FindBugs [13], and JLint [16], which use
syntactic bug pattern detection. JLint and FindBugs also
include a dataflow component. Our study also includes
In recent years, many tools have been developed for
ESC/Java [10], which uses theorem proving, and Bandera
automatically finding bugs in program source code, using
[6], which uses model checking.
techniques such as syntactic pattern matching, data flow
We ran the tools on a small suite of variously-sized Java
analysis, type systems, model checking, and theorem prov-
programs from various domains. It is a basic undecidabil-
ing. Many of these tools check for the same kinds of pro-
ity result that no bug finding tool can always report cor-
gramming mistakes, yet to date there has been little direct
rect results. Thus all of the tools must balance finding true
comparison between them. In this paper, we perform one
bugs with generating false positives (warnings about correct
of the first broad comparisons of several Java bug-finding
code) and false negatives (failing to warn about incorrect
tools over a wide variety of tasks.
code). All of the tools make different tradeoffs, and these
In the course of our experiments, we discovered, some-
choices are what cause the tools to produce the wide range
what surprisingly, that there is clearly no single “best” bug-
of results we observed for our benchmark suite.
finding tool. Indeed, we found a wide range in the kinds of
bugs found by different tools (Section 2). Even in the cases
The main contributions of this paper are as follows:
when different tools purport to find the same kind of bug,
we found that in fact they often report different instances
• We present what we believe is the first detailed com-
parison of several different bug finding tools for Java
∗This research was supported in part by NSF CCF-0346982.
over a variety of checking tasks.

• We show that, even for the same checking task, there
otherwise. The result is most likely a logical error, since
is little overlap in the warnings generated by the tools.
a programmer might believe this code will result in x=4
We believe this occurs because all of the tools choose
when it really results in x=2. Depending on later uses of x,
different tradeoffs between generating false positives
this could be a major error. Used with the right rulesets for
and false negatives.
ensuring that all if statements use braces around the body,
PMD will flag this program as suspicious.
• We also show that the warning counts from different
The following more blatant error is detected by JLint,
tools are not generally correlated. Given this result,
FindBugs, and ESC/Java:
we believe that there will always be a need for multiple
separate tools, and we propose a bug finding meta-tool
String s = new String("I’m not null...yet");
for combining the results of different tools together
s = null;
and cross-referencing their output to prioritize warn-
System.out.println(s.length());
ings. We show that two different metrics tend to rank
code similarly.
This segment of code will obviously cause an exception at
runtime, which is not desirable, but will have the effect of
1.1
Threats to Validity
halting the program as soon as the error occurs (assuming
the exception is not caught). Moreover, if it is on a common
There are a number of potential threats to the validity of
program path, this error will most likely be discovered when
this study. Foremost is simply the limited scope of the study,
the program is run, and the exception will pinpoint the exact
both in terms of the test suite size and in terms of the selec-
location of the error.
tion of tools. We believe, however, that we have chosen a
When asked which is the more severe bug, many pro-
representative set of Java benchmarks and Java bug finding
grammers might say that a null dereference is worse than
tools. Additionally, there may be other considerations for
not using braces in an if statement (which is often not an
tools for languages such as C and C++, which we have not
error at all). And yet the logical error caused by the lack of
studied. However, since many tools for those languages use
braces might perhaps be much more severe, and harder to
the same basic techniques as the tools we studied, we think
track down, than the null dereference.
that the lessons we learned will be applicable to tools for
These small examples illustrate that for any particular
those languages as well.
program bug, the severity of the error cannot be separated
Another potential threat to validity is that we did not ex-
from the context in which the program is used. With this in
actly categorize every false positive and false negative from
mind, in Section 6 we mention a few ways in which user-
the tools. Doing so would be extremely difficult, given the
specified information about severity might be taken into ac-
large number of warnings from the tools and the fact that
count.
we ourselves did not write the benchmark programs in our
study. Instead, in Section 4.1, we cross-check the results
2
Background
of the tools with each other in order to get a general sense
of how accurate the warnings are, and in order to under-
stand how the implementation techniques affect the gener-
2.1
A Small Example
ated warnings. We leave it as interesting future work to
check for false negatives elsewhere, e.g., in CVS revision
The code sample in Figure 1 illustrates the variety and
histories or change logs.
typical overlap of bugs found by the tools. It also illustrates
A final threat to validity is that what we make no distinc-
the problems associated with false positives and false nega-
tion between the severity of one bug versus another. Quanti-
tives. The code in Figure 1 compiles with no errors and no
fying the severity of bugs is a difficult problem, and it is not
warnings, and though it won’t win any awards for function-
the focus of this paper. For example, consider the following
ality, it could easily be passed off as fine. However, four
piece of code:
of the five tools were each able to find at least one bug in
this program. (Bandera wasn’t tested against the code for
int x = 2, y = 3;
reasons explained later.)
if (x == y)
PMD discovers that the variable y on line 8 is never
if (y == 3)
used and generates an “Avoid unused local variables” warn-
x = 3;
ing. FindBugs displays a “Method ignores results of In-
else
putStream.read()” warning for line 12; this is an error be-
x = 4;
cause the result of InputStream.read() is the num-
In this example, indentation would suggest that the else
ber of bytes read, and this may be fewer bytes than the pro-
corresponds to the first if, but the language grammar says
grammer is expecting. FindBugs also displays a “Method

1
import java.io.*;
Name
Version
Input
Interfaces
Technology
2
public class Foo{
Bandera
0.3b2
Source
CL, GUI
Model
3
private byte[] b;
(2003)
checking
4
private int length;
ESC/Java
2.0a7
Source1
CL, GUI
Theorem
5
Foo(){ length = 40;
(2004)
proving
6
b = new byte[length]; }
FindBugs
0.8.2
Bytecode
CL, GUI,
Syntax,
7
public void bar(){
(2004)
IDE, Ant
dataflow
8
int y;
JLint
3.0
Bytecode
CL
Syntax,
9
try {
(2004)
dataflow
10
FileInputStream x =
PMD
1.9
Source
CL, GUI,
Syntax
11
new FileInputStream("z");
(2004)
Ant, IDE
12
x.read(b,0,length);
13
x.close();}
CL - Command Line
14
catch(Exception e){
1ESC/Java works primarily with source but may require bytecode or
15
System.out.println("Oopsie");}
specification files for supporting types.
16
for(int i = 1; i <= length; i++){
17
if (Integer.toString(50) ==
18
Byte.toString(b[i]))
Figure 2. Bug Finding Tools and Their Basic
19
System.out.print(b[i] + " ");
Properties
20
}
21
}
22
}
FindBugs
[13] is a bug pattern detector for Java. Find-
Bugs uses a series of ad-hoc techniques designed to balance
Figure 1. A Sample Java Class
precision, efficiency, and usability. One of the main tech-
niques FindBugs uses is to syntactically match source code
to known suspicious programming practice, in a manner
may fail to close stream on exception” warning and a warn-
similar to ASTLog [7]. For example, FindBugs checks that
ing on lines 17-18 for using “==” to compare String objects
calls to wait(), used in multi-threaded Java programs, are
(which is incorrect). ESC/Java displays “Warning: Array
always within a loop—which is the correct usage in most
index possibly too large” because the comparison on lines
cases. In some cases, FindBugs also uses dataflow analy-
17-18 may access an element outside the bounds of the ar-
sis to check for bugs. For example, FindBugs uses a sim-
ray due to an error in the loop guard on line 16. This, as we
ple, intraprocedural (within one method) dataflow analysis
can see, is a valid error. However, ESC/Java also displays
to check for null pointer dereferences.
a warning for “Possible null dereference” on line 18, which
FindBugs can be expanded by writing custom bug detec-
is a false positive since b is initialized in the constructor.
tors in Java. We set FindBugs to report “medium” priority
Finally, JLint displays a “Compare strings as object refer-
warnings, which is the recommended setting.
ences” warning for the string comparison on lines 17-18.
This is the same error that FindBugs detected, which illus-
trates that there is some overlap between the tools.
JLint
[1, 16], like FindBugs, analyzes Java bytecode, per-
This is the sum total of all the errors reported by the
forming syntactic checks and dataflow analysis. JLint also
tools. The results for this small example also illustrate sev-
includes an interprocedural, inter-file component to find
eral cases of false negatives. FindBugs, for instance, also
deadlocks by building a lock graph and ensuring that there
looks for unused variables, but does not discover that the
are never any cycles in the graph. JLint 3.0, the version we
variable y on line 8 was never used. As another example,
used, includes the multi-threaded program checking exten-
JLint sometimes warns about indexing out of bounds, but
sions described by Artho [1]. JLint is not easily expandable.
JLint does not recognize the particular case on line 16. Fur-
ther examples of overlapping warnings between programs,
PMD
[18], like FindBugs and JLint, performs syntac-
false positives, and false negatives are all discussed later on
tic checks on program source code, but it does not have
in the paper.
a dataflow component. In addition to some detection of
clearly erroneous code, many of the “bugs” PMD looks
2.2
Java Bug Finding Tools
for are stylistic conventions whose violation might be sus-
picious under some circumstances. For example, having
Figure 2 contains a brief summary of the five tools we
a try statement with an empty catch block might in-
study in this paper, and below we discuss each of them in
dicate that the caught error is incorrectly discarded. Be-
more detail.
cause PMD includes many detectors for bugs that depend

on programming style, PMD includes support for select-
tools. However, as we will discuss in Section 3, without
ing which detectors or groups of detectors should be run.
annotations ESC/Java produces a multitude of warnings.
In our experiments, we run PMD with the rulesets recom-
Houdini [9] can automatically add ESC/Java annotations
mended by the documentation: unusedcode.xml, basic.xml,
to programs, but it does not work with ESC/Java 2 [4].
import.xml, and favorites.xml. The number of warnings can
Daikon [8] can also be used as an annotation assistant to
increase or decrease depending on which rulesets are used.
ESC/Java, but doing so would require selecting representa-
PMD is easily extensible by programmers, who can write
tive dynamic program executions that sufficiently cover the
new bug pattern detectors using either Java or XPath.
program paths, which we did not attempt. Since ESC/Java
really works best with annotations, in this paper we will
Bandera
[6] is a verification tool based on model check-
mostly use it as a point of comparison and do not include it
ing and abstraction. To use Bandera, the programmer anno-
in the meta-tool metrics in Section 4.2.
tates their source code with specifications describing what
should be checked, or no specifications if the programmer
2.3
Taxonomy of Bugs
only wants to verify some standard synchronization prop-
erties. In particular, with no annotations Bandera verifies
We classified all of the bugs the tools find into the groups
the absence of deadlocks. Bandera includes optional slicing
listed in Figure 3. The first column lists a general class of
and abstraction phases, followed by model checking. Ban-
bugs, and the second column gives one common example
dera can use a variety of model checkers, including SPIN
from that class. The last columns indicate whether each
[12] and the Java PathFinder [11].
tool finds bugs in that category, and whether the tools find
We included Bandera in our study because it uses a com-
the specific example we list. We did not put Bandera in
pletely different technique than the other tools we looked
this table, since without annotations its checks are limited
at. Unfortunately, Bandera version 0.3b2 does not run on
to synchronization properties.
any realistic Java programs, including our benchmark suite.
These classifications are our own, not the ones used in
The developers of Bandera acknowledge on their web page
the literature for any of these tools. With this in mind, no-
that it cannot analyze Java (standard) library calls, and un-
tice that the largest overlap is between FindBugs and PMD,
fortunately the Java library is used extensively by all of our
which share 6 categories in common. The “General” cate-
benchmarks. This greatly limits the usability and applica-
gory is a catch-all for checks that do not fit in the other cat-
bility of Bandera (future successors will address this prob-
egories, so all tools find something in that category. All of
lem). We were able to successfully run Bandera and the
the tools also look for concurrency errors. Overall, there are
other tools on the small example programs supplied with
many common categories among the tools and many cate-
Bandera. Section 5 discusses the results.
gories on which the tools differ.
Other fault classifications that have been developed are
not appropriate for our discussion. Two such classifications,
ESC/Java
[10], the Extended Static Checking system for
the Orthogonal Defect Classification [3] and the IEEE Stan-
Java, based on theorem proving, performs formal verifica-
dard Classification for Software Anomalies [14], focus on
tion of properties of Java source code. To use ESC/Java, the
the overall software life cycle phases. Both treat faults at
programmer adds preconditions, postconditions, and loop
a much higher-level than we do in this paper. For exam-
invariants to source code in the form of special comments.
ple, they have a facility for specifying that a fault is a logic
ESC/Java uses a theorem prover to verify that the program
problem, but do not provide specifications for what the logic
matches the specifications.
problem leads to or was caused by, such as incorrect syn-
ESC/Java is designed so that it can produce some useful
chronization.
output even without any specifications, and this is the way
we used it in our study. In this case, ESC/Java looks for er-
rors such as null pointer dereferences, array out-of-bounds
3
Experiments
errors, and so on; annotations can be used to remove false
positives or to add additional specifications to be checked.
To generate the results in this paper, we wrote a series
For our study, we used ESC/Java 2 [5], a successor to the
of scripts that combine and coordinate the output from the
original ESC/Java project. ESC/Java 2 includes support for
various tools. Together, these scripts form a preliminary
Java 1.4, which is critical to analyzing current applications.
version of the bug finding meta-tool that we mentioned in
ESC/Java 2 is being actively developed, and all references
the introduction. This meta-tool allows a developer to ex-
to ESC/Java will refer to the ESC/Java 2, rather than the
amine the output from all the tools in a common format and
original ESC/Java.
find what classes, methods, and lines generate warnings.
We included ESC/Java in our set of tools because its ap-
As discussed in the introduction, we believe that such a
proach to finding bugs is notably different from the other
meta-tool can provide much better bug finding ability than

Bug Category
Example
ESC/Java
FindBugs
JLint
PMD




General
Null dereference
*
*
*




Concurrency
Possible deadlock
*
*

Exceptions
Possible unexpected exception
*


Array
Length may be less than zero
*


Mathematics
Division by zero
*


Conditional, loop
Unreachable code due to constant guard
*



String
Checking equality using == or !=
*



Object overriding
Equal objects must have equal hashcodes
*
*
*

I/O stream
Stream not closed on all paths
*


Unused or duplicate statement
Unused local variable
*

Design
Should be a static inner class
*

Unnecessary statement
Unnecessary return statement
*
√ - tool checks for bugs in this category
* - tool checks for this specific example
Figure 3. The Types of Bugs Each Tool Finds
the tools in isolation. As Figure 3 shows, there is a lot
Azureus 2.0.7 Java Bit Torrent client4
of variation even in the kinds of bugs found by the tools.
Moreover, as we will discuss in Section 4.1, there are not
Megamek 0.29 Online version of BattleTech game5
many cases where multiple tools warn about the same po-
Figure 4 lists the size of each benchmark in terms of both
tential problem. Having a meta-tool means that a developer
Non Commented Source Statements (NCSS), roughly the
need not rely on the output of a single tool. In particular,
number of ’;’ and ’{’ characters in the program, and the
the meta-tool can rank classes, methods, and lines by the
number of class files. The remaining columns of Figure 4
number of warnings generated by the various tools. In Sec-
list the running times and total number of warnings gener-
tion 4.2, we will discuss simple metrics for doing so and
ated by each tool. Section 4 discusses the results in-depth;
examine the results.
here we give some high-level comments. Bandera is not in-
Of course, rather than having a meta-tool, perhaps the
cluded in this table, since it does not run on any of these
ideal situation would be a single tool with many different
examples. See Section 5.
analyses built-in, and the different analyses could be com-
To compute the running times, we ran all of the programs
bined and correlated in the appropriate fashion. However,
from the command line, as the optional GUIs can poten-
as a practical matter, the tools tend to be written by a wide
tially reduce performance. Execution times were computed
variety of developers, and so at least for now having a sep-
with one run, as performance is not the emphasis of this
arate tool to combine their results seems necessary.
study. The tests were performed on a Mac OS X v10.3.3
The preliminary meta-tool we built for this paper is fairly
system with a 1.25 GHz PowerPC G4 processor and 512
simple. Its main tasks are to parse the different textual
MB RAM. Because PMD accepts only one source file at
output of the various tools (ranging from delimited text to
a time, we used a script to invoke it on every file in each
XML) and map the warnings, which are typically given by
benchmark. Unfortunately, since PMD is written in Java,
file and line, back to classes and methods. We computed
each invocation launches the Java virtual machine sepa-
the rankings in a separate pass. Section 6 discusses some
rately, which significantly reduces PMD’s performance. We
possible enhancements to our tool.
expect that without this overhead, PMD would be approx-
We selected as a testbed five mid-sized programs com-
imately 20% faster. Recall that we used ESC/Java without
piled with Java 1.4. The programs represent a range of
annotations; we do not know if adding annotations would
applications, with varying functionality, program size, and
affect ESC/Java’s running time, but we suspect it will still
program maturity. The five programs are
run significantly slower than the other tools. Speaking in
Apache Tomcat 5.019 Java Servlet and JavaServer Pages
general terms, ESC/Java takes a few hours to run, FindBugs
implementation, specifically catalina.jar1
and PMD take a few minutes, and JLint takes a few seconds.
For each tool, we report the absolute number of warnings
JBoss 3.2.3 J2EE application server2
generated, with no normalization or attempt to discount re-
Art of Illusion 1.7 3D modeling and rendering studio3
peated warnings about the same error. Thus we are mea-
suring the total volume of information presented to a de-
1http://jakarta.apache.org/tomcat
2http://www.jboss.org
4http://azureus.sourceforge.net
3http://www.artofillusion.org
5http://megamek.sourceforge.net

NCSS
Class
Time (min:sec.csec)
Warning Count
Name
(Lines)
Files
ESC/Java
FindBugs
JLint
PMD
ESC/Java
FindBugs
JLint
PMD
Azureus 2.0.7
35,549
1053
211:09.00
01:26.14
00:06.87
19:39.00
5474
360
1584
1371
Art of Illusion 1.7
55,249
676
361:56.00
02:55.02
00:06.32
20:03.00
12813
481
1637
1992
Tomcat 5.019
34,425
290
90:25.00
01:03.62
00:08.71
14:28.00
1241
245
3247
1236
JBoss 3.2.3
8,354
274
84:01.00
00:17.56
00:03.12
09:11.00
1539
79
317
153
Megamek 0.29
37,255
270
23:39.00
02:27.21
00:06.25
11:12.00
6402
223
4353
536
Figure 4. Running Time and Warnings Generated by Each Tool
ESC/
Find
Java
Bugs
JLint
PMD
Concurrency Warnings
126
122
8883
0
Null Dereferencing
9120
18
449
0
Null Assignment
0
0
0
594
Index out of Bounds
1810
0
264
0
Prefer Zero Length Array
0
36
0
0
Figure 6. Warning Counts for the Categories
Discussed in Section 4.1

Figure 5. Histogram for number of warnings
found per class

counts. Even after restricting ourselves to these three cate-
gories, there is still a large number of warnings, and so our
manual examination is limited to several dozen warnings.
veloper from each tool. For ESC/Java, the number of gen-
erated warnings is sometimes extremely high. Among the
other tools, JLint tends to report the largest number of warn-
Concurrency Errors
All of the tools check for at least
ings, followed by PMD (though for Art of Illusion, PMD
one kind of concurrency error. ESC/Java includes support
reported more warnings than JLint). FindBugs generally
for automatically checking for race conditions and potential
reports fewer warnings than the other tools. In general, we
deadlocks. ESC/Java finds no race conditions, but it issues
found this makes FindBugs easier to use, because there are
126 deadlock warnings for our benchmark suite. After in-
fewer results to examine.
vestigating a handful of these warnings, we found that some
Figure 5 shows a histogram of the warning counts per
of them appear to be false positives. Further investigation
class.
(We do not include classes with no warnings.)
is difficult, because ESC/Java reports synchronized
Clearly, in most cases, when the tools find potential bugs,
blocks that are involved in potential deadlocks but not the
they only find a few, and the number of classes with mul-
sets of locks in each particular deadlock.
tiple warnings drops off rapidly. For PMD and JLint, there
PMD includes checks for some common bug patterns,
are quite a few classes that have 19 or more warnings, while
such as the well-known double-checked locking bug in Java
these are rare for FindBugs. For ESC/Java, many classes
[2]. However, PMD does not issue any such warnings for
have 19 or more warnings.
our benchmarks. In contrast, both FindBugs and JLint do
report warnings. Like PMD, FindBugs also checks for uses
of double-checked locking. Interestingly, despite PMD re-
4
Analysis
porting no such cases, FindBugs finds a total of three uses of
double-checked locking in the benchmark programs. Man-
4.1
Overlapping Bug Categories
ual examination of the code shows that, indeed, those three
uses are erroneous. PMD does not report this error because
Clearly the tools generate far too many warnings to re-
its checker is fooled by some other code mixed in with the
view all of them manually. In this section, we examine the
bug pattern (such as try/catch blocks).
effectiveness of the tools on three checking tasks that sev-
FindBugs also warns about the presence of other concur-
eral of the tools share in common: concurrency, null deref-
rency bug patterns, such as not putting a monitor wait()
erence, and array bounds errors. Even for the same task we
call in a while loop. Examining the results in detail, we
found a wide variation in the warnings reported by differ-
discovered that the warnings FindBugs reports usually cor-
ent tools. Figure 6 contains a breakdown of the warning
rectly indicate the presence of the bug pattern in the code.

What is less clear is how many of the patterns detected cor-
Interestingly, FindBugs discovers a very small set of po-
respond to actual errors. For example, since FindBugs does
tential null dereferences compared to both ESC/Java and
not perform interprocedural analysis (it analyzes a single
JLint. This is because FindBugs uses several heuristics to
method at a time), if a method with a wait() is itself
avoid reporting null-pointer dereference warnings in certain
called in a loop, FindBugs will still report a warning (though
cases when its dataflow analysis loses precision.
this did not happen in our benchmarks). And, of course, not
PMD does not check for null pointer dereferences, but it
all uses of wait() outside of a loop are incorrect.
does warn about setting certain objects to null. We suspect
On our test suite, JLint generates many warnings about
this check is not useful for many common coding styles.
potential deadlocks. In some cases, JLint produces many
ESC/Java also checks for some other uses of null that vi-
warnings for the same underlying bug. For instance, JLint
olate implicit specifications, e.g., assigning null to a field
checks for deadlock by producing a lock graph and look-
assumed not to be null. In a few cases, we found that PMD
ing for cycles. In several cases in our experiments, JLint
and ESC/Java null warnings coincide with each other. For
iterates over the lock graph repeatedly, reporting the same
example, in several cases PMD reported an object being set
cycle many times. In some cases, the same cycle generated
to null, and just a few lines later ESC/Java issued a warning
several hundred warnings. These duplicates, which make
about assigning null to another object.
it difficult to use the output of JLint, could be eliminated
by reporting a cycle in the lock graph just once. The sheer
Array Bounds Errors
In Java, indexing outside the
quantity of output from JLint makes it difficult to judge the
bounds of an array results is a run-time exception. While
rate of false positives for our benchmark suite. In Section 5
a bounds error in Java may not be the catastrophic error
we compare finding deadlocks using JLint and Bandera on
that it can be for C and C++ (where bounds errors over-
smaller programs.
write unexpected parts of memory), they still indicate a bug
in the program. Two of the tools we examined, JLint and
Null Dereferences
Among the four tools, ESC/Java,
ESC/Java, include checks for array bounds errors—either
FindBugs, and JLint check for null dereferences. Surpris-
creating an array with a negative size, or accessing an array
ingly, there is not a lot of overlap between the warnings
with an index that is negative or greater than the size of the
reported by the various tools.
array.
JLint finds many potential null dereferences. In order
Like null dereference warnings, JLint and ESC/Java do
to reduce the number of warnings, JLint tries to only iden-
not always report the same warnings in the same places.
tify inconsistent assumptions about null. For example, JLint
ESC/Java mainly reports warnings because parameters that
warns if an object is sometimes compared against null be-
are later used in array accesses may not be within range (an-
fore it is dereferenced and sometimes not. However, we
notations would help with this). JLint has several false pos-
have found that in a fair number of cases, JLint’s null deref-
itives and some false negatives in this category, apparently
erence warnings are false positives. A common example is
because it does not track certain information interprocedu-
when conditional tests imply that an object cannot be null
rally in its dataflow analysis. For example, code such as this
(e.g., because it was not null previously when the condi-
appeared in our benchmarks:
tion held). In this case, JLint often does not track enough
information about conditionals to suppress the warning. Fi-
public class Foo {
nally, in some cases there are warnings about null pointer
static Integer[] ary = new Integer[2];
dereferences that cannot happen because of deeper program
logic; not many static analyses could handle these cases.
public static void assign() {
Currently, there is no way to stop these warnings from be-
Object o0 = ary[ary.length];
ing reported (sometimes multiple times).
Object o1 = ary[ary.length-1];
ESC/Java reports the most null pointer dereferences be-
}
cause it often assumes objects might be null, since we did
}
not add any annotations to the contrary.
(Interestingly,
ESC/Java does not always report null dereference warnings
In this case, JLint signals a warning that the array index
in the same places as JLint). The net result is that, while
might be out of bounds for the access to o1 (because it
potentially those places may be null pointer errors, there
thinks the length of the array might be 0), but clearly that
are too many warnings to be easily useful by themselves.
is not possible here. On the other hand, there are no warn-
Instead, to make the most effective use of these checks, it
ings for the access to o0, even though it will always be out
seems the programmer should provide annotations. For ex-
of bounds no matter what size the array is.
ample, in method declarations parameters that are never null
FindBugs and PMD do not check for array bounds er-
can be marked as such to avoid spurious warnings.
rors, though FindBugs does warn about returning null from

Correlation
We studied two metrics for ranking code. As mentioned
Tools
coefficient
in Section 2.2, we do not include ESC/Java in this discus-
JLint vs PMD
0.15
sion.
JLint vs FindBugs
0.33
For the first metric, we started with the number of warn-
FindBugs vs PMD
0.31
ings per class file from each tool. (The same metric can
also be used per method, per lexical scope, or per line.)
Figure 7. Correlation among Warnings from
For a particular benchmark and a particular tool, we linearly
Pairs of Tools
scaled the per-class warning counts to range between 0 and
1, with 1 being the maximum number of per-class warning
counts reported by the tool over all our benchmarks.
Formally, let n be the total number of classes, and let
a method that returns an array (it may be better to use a
Xi be the number of warnings reported by tool X for class
0-length array).
number i, where i ∈ 1..n. Then we computed a normalized
warning count
4.2
Cross-Tool Buggy Code Correlations
n
X = X
max X
i
i/
i
i=1
Then for class number i, we summed the normalized
When initially hypothesizing about the relationship
warning counts from each tool to compute our first metric,
among the tools, we conjectured that warnings among the
the normalized warning total:
different tools were correlated, and that the meta-tool would
show that more warnings from one tool would correspond
Total i = FindBugs + JLint
i
i + PMD i
to more warnings from other tools. However, we found
In order to avoid affecting the scaling for JLint, we re-
that this is not necessarily the case. Figure 7 gives the cor-
duced its warning count for the class with the highest num-
relation coefficients for the number of warnings found by
ber of errors from 1979 to 200, and for the next four highest
pairs of tools per class. As these results indicate, the large
classes to 199 through 196, respectively (to maintain their
number of warnings reported by some tools are sometimes
ranking)
simply anomalous, and there does not seem to be any gen-
With this first metric, the warning counts could be biased
eral correlation between the total number of warnings one
by repeated warnings about the same underlying bug. In or-
tools generates and the total number of warnings another
der to compensate for this possibility, we developed a sec-
tool generates for any given class.
ond metric, the unique warning total, that counts only the
We also wanted to check whether the number of warn-
first instance of each type of warning message generated by
ings reported is simply a function of the number of lines
a tool. For example, no matter how many null pointer deref-
of code. Figure 8 gives correlation coefficients and scatter
erences FindBugs reports in a class, we only count this as 0
plots showing, for each Java source file (which may include
(if none were found) or 1 (if one or more were found). In
several inner classes), the NCSS count versus the number of
this metric, we sum the number of unique warnings from all
warnings. For JLint, we have removed from the chart five
the tools.
source files that had over 500 warnings each, since adding
these makes it hard to see the other data points. As these
plots show, there does not seem to be any general corre-
4.2.2
Results
lation between lines of code and number of warnings pro-
We applied these metrics to our benchmark suite, ranking
duced by any of the tools. JLint has the strongest correlation
the classes according to their normalized and unique warn-
of the three, but it is still weak.
ing totals. As it turns out, these two metrics are fairly well
correlated, especially for the classes that are ranked highest
4.2.1
Two Simple Metrics for Isolating Buggy Code
by both metrics. Figure 9 shows the relationship between
the normalized warning count and the number of unique
Given that the tools’ warnings are not generally correlated,
warnings per class. The correlation coefficient for this rela-
we hypothesize that combining the results of multiple tools
tionship is 0.758. Of course, it is not surprising that these
together can identify potentially troublesome areas in the
metrics are correlated, because they are clearly not indepen-
code that might be missed when using the tools in isola-
dent (in particular, if one is non-zero then the other must be
tion. Since we do not have exhaustive information about
as well). However, a simple examination of certain classes
the severity of faults identified by the warnings and rates
shows that the high correlation coefficient between the two
of false positives and false negatives, we cannot form any
is not obvious. For instance, the class catalina.context has a
strong conclusions about the benefit of our metrics. Thus in
warning count of 0 for FindBugs and JLint, but PMD gen-
this section we perform only a preliminary investigation.
erates 132 warnings. (As it turns out, PMD’s warnings are

Figure 8. Comparison of Number of Warnings versus NCSS
Bugs, 11th for JLint, and 349th for PMD—thus if we were
only using a single tool, we would be unlikely to examine
the warnings for it immediately.
In general, the normalized warning total measures the
number of tools that find an unusually high number of warn-
ings. The metric is still susceptible, however, to cases where
a single tool produces a multitude of spurious warnings.
For example, megamek.server.Server has hundreds of null
dereference warnings from JLint, many likely false posi-
tives, which is why it is ranked second in this metric. In the
Figure 9. Normalized Warnings versus the
case of artofillusion.object.TriangleMesh, 102 out of 140 of
Unique Warnings per Class
the warnings from PMD are for not using brackets in a for
statement—which it probably not a mistake at all.
On the other hand, the unique warning total measures
the breadth of warnings found by the tools. This metric
uninteresting). This class ranks 11th in normalized warning
compensates for cascading warnings of the same kind, but
total, but 587th in unique warning total (all 132 warnings
it can be fooled by redundancy among the different tools.
are the same kind). Thus just because a class generates a
For example, if by luck a null deference error is caught by
large number of warnings does not necessarily mean that it
two separate tools, then the warning for that error will be
generates a large breath of warnings.
counted twice. This has a large affect on the unique warn-
We manually examined the warnings for the top five
ing counts, because they are in general small. An improved
classes for both metrics, listed in Figure 10.
For these
metric could solve this problem by counting uniqueness of
classes, Figure 10 shows the size of the class, in terms of
errors across all tools (which requires identifying duplicate
NCSS and number of methods, the normalized warning to-
messages across tools, a non-obvious task for some warn-
tal and rank, the total number of warnings found by each of
ings that are close but not identical).
the tools, and the number of unique warnings and rank. In
We think that both metrics provide a useful gauge that
this table, T-n denotes a class ranked n, which is tied with
allows programmers to go beyond finding individual bugs
at least one other class in the ranking.
with individual tools. Instead, these metrics can be used
Recall that the goal of our metrics is to identify code
to find code with an unusually high number and breadth of
that might be missed when using the tools in isolation. In
warnings from many tools—and our results show that both
this table, the top two classes in both metrics are the same,
seem to be correlated for the highest-ranked classes.
catalina.core.StandardContext and megamek.server.Server,
and both also have the most warnings of any class from, re-
5
Bandera
spectively, FindBugs and JLint. Thus these classes, as well
as artofillusion.object.TriangleMesh (with the most warn-
Bandera cannot analyze any of our benchmarks from
ings from PMD), can be identified as highly suspicious by
Section 3, because it cannot analyze the Java library. In
a single tool.
order to compare Bandera to the other tools, we used the
On the other hand, azureus2.ui.swt.MainWindow could
small examples supplied with Bandera as a test suite, since
be overlooked when considering only one tool at a time.
we knew that Bandera could analyze them.
It is ranked in the top 10 for both of our metrics, but it is
This test suite from Bandera includes 16 programs rang-
4th for FindBugs in isolation, 13th for JLint, and 30th for
ing from 100-300 lines, 8 of which contain a real deadlock.
PMD. As another example, catalina.core.StandardWrapper
None of the programs include specifications—without spec-
(4th for the unique warning metric), is ranked 45th for Find-
ifications, Bandera will automatically check for deadlock.

Total Warnings
Normalized
Unique Warnings
Name
NCSS
Mthds
FB
JL
PMD
Total i
Rank
FB
JL
PMD
Total
Rank
catalina.core.StandardContext
1863
255
∗34
791
37
2.25
1
9
10
5
24
1
megamek.server.Server
4363
198
6
∗1979
42
1.48
2
6
10
4
20
2
azureus2.ui.swt.MainWindow
1517
87
11
90
30
0.99
9
5
8
4
17
3
catalina.core.StandardWrapper
513
75
10
50
8
0.60
19
6
6
3
15
4
catalina.core.StandardHost
279
55
4
97
3
0.62
17
10
3
1
14
5
catalina.core.ContainerBase
518
70
14
849
3
1.42
3
3
7
3
13
T-8
artofillusion.object.TriangleMesh
2213
59
5
42
∗140
1.36
4
3
7
3
13
T-8
megamek.common.Compute
2250
109
0
1076
23
1.16
5
0
7
3
10
T-22
* - Class with highest number of warnings from this tool
Figure 10. Classes Ranked Highly by Metrics
For this test suite, Bandera finds all 8 deadlocks and pro-
sions of the tools because they were not compatible with
duces no messages concerning the other 8 programs.
the latest version of Java. As mentioned earlier, when we
In comparison, FindBugs and PMD do not issue any
initially experimented with ESC/Java, we downloaded ver-
warnings that would indicate a deadlock.
PMD reports
sion 0.7 and discovered that it was not compatible with Java
19 warnings, but only about null assignments and miss-
1.4. Fortunately ESC/Java 2, which has new developers, is
ing braces around loop bodies, which in this case has
compatible, so we were able to use that version for our ex-
no effect on synchronization.
FindBugs issues 5 warn-
periments. But we are still unable to use some important
ings, 4 of which were about package protections and the
other relations of ESC/Java such as Houdini, which is not
other of which warned about using notify() instead of
compatible with ESC/Java 2. We had similar problems with
notifyAll() (the use of notify() is correct).
an older version of JLint, which also did not handle Java
On the other hand, ESC/Java reports 79 warnings, 30 of
1.4. The lesson for users is probably to rely only on tools
which are for potential deadlocks in 9 of the programs. One
under active development, and the lesson for tool builders
of the 9 programs did not have deadlock. JLint finds po-
is to keep up with the latest language features lest a tool be-
tential synchronization bugs in 5 of the 8 programs Bandera
come unusable. This may especially be an issue with the
verified to have a deadlock error. JLint issues three differ-
upcoming Java 1.5, which includes source-level extensions
ent kinds of concurrency warnings for these programs: a
such as generics.
warning for changing a lock variable that has been used in
In our opinion, tools that provide graphical user inter-
synchronization, a warning for requesting locks that would
faces (GUIs) or plugins for a variety of integrated develop-
lead to a lock cycle, and a warning for improper use of mon-
ment environments have a clear advantage over those tools
itor objects. In all, JLint reported 34 potential concurrency
that provide only textual output. A well-designed GUI can
bugs over 5 programs.
group classes of bugs together and hyperlink warnings to
Compared to JLint, Bandera has the advantage that it can
source code. Although we did not use them directly in our
produce counterexamples. Because it is based on model
study, in our initial phase of learning to use the tools we
checking technology, when Bandera finds a potential dead-
found GUIs invaluable. Unfortunately, GUIs conflict some-
lock it can produce a full program trace documenting the
what with having a meta-tool, since they make it much more
sequence of operations leading up to the error and a graph-
difficult for a meta-tool to extract the analysis results. Thus
ical representation of the lock graph with a deadlock. Non-
probably the best compromise is to provide both structural
model checking tools such as JLint often are not as well
text output (for processing by the meta-tool) and a GUI. We
geared to generating counterexample traces.
leave as future work the development of a generic, easy-to-
use GUI for the meta-tool itself.
6
Usability Issues and Improvements to the
Also, while developers want to find as many bugs as pos-
Meta-Tool
sible, it is important not to overwhelm the developer with
too much output. In particular, one critical ability is to
In the course of our experiments, we encountered a num-
avoid cascading errors. For example, in some cases JLint
ber of issues in applying the tools to our benchmark suite.
repeatedly warns about dereferencing a variable that may
Some of these issues must be dealt with within a tool, and
be null, but it would be sufficient to warn only at the first
some of the issues can be addressed by improving our pro-
dereference. These cases may be possible to eliminate with
posed meta-tool.
the meta tool. Or, better yet, the tool could be modified
In a number of cases, we had difficulty using certain ver-
so that once a warning about a null pointer is issued, the

pointer would subsequently be assumed not to be null (or
benchmarks and proposing a meta-tool to examine the cor-
whatever the most optimistic assumption is) to suppress fur-
relations.
ther warnings. Similarly, sometimes JLint produces a large
Z-ranking [17] is a technique for ranking the output of
number of potential deadlock warnings, even reporting the
static analysis tools so warnings that are more important
same warning multiple times on the same line. In this case,
will tend to be ranked more highly. As our results sug-
the meta-cool could easily filter redundant error messages
gest, having such a facility in the tools we studied would
and reduce them to a single warning. In general, the meta-
be extremely useful. Z-ranking is intended to rank the out-
tool could allow the user to select between full output from
put of a particular bug checker. In our paper, however, we
each of the tools and output limited to unique warnings.
look at correlating warnings across tools and across differ-
As mentioned throughout this paper, false positives are
ent checkers.
an issue with all of the tools. ESC/Java is the one tool
In general, since many of these Java bug finding tools
that supports user-supplied annotations to eliminate spuri-
have only been developed within the last few years, there
ous warnings. We could incorporate a poor-man’s version
has not been much work comparing them. One article on a
of this annotation facility into the meta-tool by allowing the
developer web log by Jelliffe [15] briefly describes experi-
user to suppress certain warnings at particular locations in
ence using JLint, FindBugs, PMD, and CheckStyle (a tool
the source code. This would allow the user to prune the
we did not study; it checks adherence to a coding style). In
output of tools to reduce false positives. In general such a
his opinion, JLint and FindBugs find different kinds of bugs,
facility must be used extremely carefully, since it is likely
and both are very useful on existing code, while PMD and
that subsequent code modifications might render the sup-
CheckStyle are more useful if you incorporate their rules
pression of warnings confusing or even incorrect.
into projects from the start.
A meta-tool could also interpret the output of the tools
in complex ways. In particular, it could use a warning from
8
Conclusion
one tool to decide whether another tool’s warning has a
greater probability of being valid. For example, in one case
We have examined the results of applying five bug-
we encountered, a PMD-generated warning about a null as-
finding tools to a variety of Java programs. Although there
signment coincided with a JLint warning for the same po-
is some overlap between the kinds of bugs found by the
tential bug. After a manual check of the bug, we found that
tools, mostly their warnings are distinct. Our experiments
both tools were correct in their assessment.
do, however, suggest that many tools reporting an unusual
Finally, as discussed in Section 1.1, it is impossible to
number of warnings for a class is correlated with a large
generally classify the severity of a warning without know-
breadth of unique warnings, and we propose a meta-tool to
ing the context in which the application is used. However,
allow developers to identify these classes.
it might be possible for developers to classify bug sever-
As we ran the tools and examined the output, there
ity for their own programs. Initially, warnings would be
seemed to be a few things that would be beneficial in gen-
weighed evenly, and a developer could change the weights
eral. The main difficulty in using the tools is simply the
so that different bugs were weighed more or less in the
quantity of output. In our opinion, the programmer should
meta-tool’s rankings. For example, warnings that in the past
have the ability to add an annotation or a special comment
have lead to severe errors might be good candidates for in-
into the code to suppress warnings that are false positives,
creased weight. Weights could also be used to adjust for
even though this might lead to potential future problems
false positive rates. If a particular bug checker is known
(due to changes in assumptions). Such a mechanism seems
to report many false positives for a particular application,
necessary to help reduce the sheer output of the tools. In
those warnings can be assigned a lower weight.
Section 6 we proposed adding this as a feature of the meta-
tool.
7
Related Work
In this paper we have focused on comparing the output
of different tools. An interesting area of future work is to
Artho [1] compares several dynamic and static tools for
gather extensive information about the actual faults in pro-
finding errors in multi-threaded programs. Artho compares
grams, which would enable us to precisely identify false
the tools on several small core programs extracted from a
positives and false negatives. This information could be
variety of Java applications. Artho then proposes extensions
used to determine how accurately each tool predicts faults in
to JLint, included in the version we tested in this paper,
our benchmarks. We could also test whether the two metrics
to greatly improve its ability to check for multi-threaded
we proposed for combining warnings from multiple tools
programming bugs, and gives results for running JLint on
are better or worse predictors of faults than the individual
several large applications. The focus of this paper, in con-
tools.
trast, is on looking at a wider variety of bugs across several
Finally, recall that all of the tools we used are in some

ways unsound. Thus the absence of warnings from a tool
Productivity, International Symposium of Formal Methods,
does not imply the absence of errors. This is certainly a
number 2021 in Lecture Notes in Computer Science, pages
necessary tradeoff, because as we just argued, the number
500–517, Berlin, Germany, Mar. 2001. Springer-Verlag.
of warnings produced by a tool can be daunting and stand
[10] C. Flanagan, K. R. M. Leino, M. Lillibridge, G. Nelson,
in the way of its use. As we saw in Section 3, without user
J. B. Saxe, and R. Stata. Extended Static Checking for Java.
In Proceedings of the 2002 ACM SIGPLAN Conference on
annotations a tool like ESC/Java that is still unsound yet
Programming Language Design and Implementation, pages
much closer to verification produces even more warnings
234–245, Berlin, Germany, June 2002.
than JLint, PMD, and FindBugs. Ultimately, we believe
[11] K. Havelund and T. Pressburger. Model checking JAVA pro-
there is still a wide area of open research in understanding
grams using JAVA pathfinder. International Journal on Soft-
the right tradeoffs to make in bug finding tools.
ware Tools for Technology Transfer, 2(4):366–381, 2000.
[12] G. J. Holzmann. The model checker SPIN. Software Engi-
Acknowledgments
neering, 23(5):279–295, 1997.
[13] D. Hovemeyer and W. Pugh.
Finding Bugs Is
Easy.
http://www.cs.umd.edu/˜pugh/java/
We would like to thank David Cok and Joe Kiniry for
bugs/docs/findbugsPaper.pdf, 2003.
helping us get ESC/Java 2 running. We would also like
[14] IEEE. IEEE Standard Classification for Software Anoma-
to thank Cyrille Artho for providing us with a beta version
lies, Dec. 1993. IEEE Std 1044-1993.
of JLint 3.0. Finally, we would like to thank Atif Memon,
[15] R. Jelliffe. Mini-review of Java Bug Finders. In O’Reilly
Bill Pugh, Mike Hicks, and the anonymous referees for their
Developer Weblogs. O’Reilly, Mar. 2004. http://www.
helpful comments on earlier versions of this paper.
oreillynet.com/pub/wlg/4481.
[16] JLint. http://artho.com/jlint.
[17] T. Kremenek and D. Engler. Z-Ranking: Using Statistical
References
Analysis to Counter the Impact of Static Analysis Approxi-
mations. In R. Cousot, editor, Static Analysis, 10th Interna-
[1] C. Artho. Finding faults in multi-threaded programs. Mas-
tional Symposium, volume 2694 of Lecture Notes in Com-
ter’s thesis, Institute of Computer Systems, Federal Institute
puter Science, pages 295–315, San Diego, CA, USA, June
of Technology, Zurich/Austin, 2001.
2003. Springer-Verlag.
[2] D. Bacon, J. Block, J. Bogoda, C. Click, P. Haahr,
[18] PMD/Java. http://pmd.sourceforge.net.
D. Lea,
T. May,
J.-W. Maessen,
J. D. Mitchell,
K.
Nilsen,
B.
Pugh,
and
E.
G.
Sirer.
The
“Double-Checked
Locking
is
Broken”
Declara-
tion.
http://www.cs.umd.edu/˜pugh/java/
memoryModel/DoubleCheckedLocking.html.
[3] R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday,
D. S. Moebus, B. K. Ray, and M.-Y. Wong. Orthogonal de-
fect classification - a concept for in-process measurements.
IEEE Transactions on Software Engineering, 18(11):943–
956, Nov. 1992.
[4] D. Cok. Personal communication, Apr. 2004.
[5] D. Cok and J. Kiniry.
ESC/Java 2,
Mar. 2004.
http://www.cs.kun.nl/sos/research/
escjava/index.html.
[6] J. C. Corbett, M. B. Dwyer, J. Hatcliff, S. Laubach, C. S.
Pasareanu, Robby, and H. Zheng.
Bandera: Extracting
Finite-state Models from Java Source Code. In Proceedings
of the 22nd International Conference on Software Engineer-
ing
, pages 439–448, Limerick Ireland, June 2000.
[7] R. F. Crew. ASTLOG: A Language for Examining Abstract
Syntax Trees. In Proceedings of the Conference on Domain-
Specific Languages
, Santa Barbara, California, Oct. 1997.
[8] M. D. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin.
Quickly detecting relevant program invariants.
In ICSE
2000, Proceedings of the 22nd International Conference on
Software Engineering
, pages 449–458, Limerick, Ireland,
June 7–9, 2000.
[9] C. Flanagan and K. R. M. Leino. Houdini, an Annotation
Assitant for ESC/Java. In J. N. Oliverira and P. Zave, edi-
tors, FME 2001: Formal Methods for Increasing Software