A string of characters used to name (or otherwise identify) a Java construct.
Identifiers are used in source code for the names of variables, methods, types, type parameters, and packages. Each identifier is parsed as a separate token during tokenization of a Java program. Keywords are not considered identifiers.
For example,int x = 10;
x
is an identifier.
Identifiers consist of:
1 | ident-start-charident-body-chars |
where...
ident-start-char | is either _ , $ , or any valid letter (see below).
|
ident-body-chars | is a sequence of characters composed entirely of: either _ ,
$ , or any valid letter or digit (see
below). The sequence may be of any size.
|
such that...
null
literal, or either Boolean literal.
1 An identifier, used to name a variable, method, class or other type, type parameter, etc.
Java identifiers are composed of Unicode code-points
(characters), defined by the Unicode standard. The code-points that can
comprise an identifier are defined by the static Java API methods Character.isJavaIdentifierStart(int)
and Character.isJavaIdentifierPart(int)
, for the ident-start-char and each character in ident-body-chars, respectively; if either
method returns true
for a character, then the respective
syntax element can be composed of that character in a valid identifier.
char
type. String
s
are sequences of char
s. Some Unicode code-points are now
larger than 16 bits, so two characters in Java are required to
represent them (totaling 32 bits). (This is why some methods in the Character
and String
classes are overloaded to accept an int
,
which is 32 bits.) Because of such, the Java documentation
distinguishes between characters and code-points, the
latter being represented by an int
in Java, and the former
being represented by a char
. Code-points are what most
understand a character to be: A concept of a letter, digit, or other
symbol (primarily) used for communication.
When referring to a Unicode code-point, this document uses the term code-point.
The first character of a Java identifier may be any Unicode code-point whose general category:
Unicode groups characters into blocks and into general categories (and sub-categories). Blocks are generally used to group code-points by language, function, etc. A character can only be in one block and blocks are contiguous. General Categories are an attribute of code-points. Each code-point is assigned exactly one general category and sub-category (known as the Major and minor categories, respectively).L
(i.e., the code-point's major category
is Letter),
Nl
(i.e., the code-point is a: Number, letter),
Sc
(i.e., the code-point is a: Symbol, currency),
Pc
(i.e., the code-point is a: Punctuation,
connector).
Note that this encompasses the overwhelming majority of Unicode code-points; more than 130 thousand characters can be used to begin a Java identifier.
Subsequent characters in a Java identifier may be any code-point that is ignored (see below), can begin a Java identifier, or any codepoint whose general category:
Nd
(i.e., the code-point is a: Number, decimal
digit),
Mc
(i.e., the code-point is a: Mark, spacing
combining),
Mn
(i.e., the code-point is a: Mark, nonspacing)
Cf
(i.e., the code-point is an: Other, format;
note that these characters are ignored
characters).
Ignored characters are code-points for which Character.isIdentifierIgnorable(int)
returns true
. This includes those whose general category
is Cf
(as listed above), and all code-points from:
\u0000
(the null character) to \u0008
(the backspace character),\u000E
(the shift out character) to \u001B
(the escape character),\u007F
(the delete character) to \u009F
(the application program command character).These characters are not actually ignored, but are rather allowed as subsequent characters in an identifier.
int x, y, z;
double dbl1, dbl2, dbl3, dbl4, dbl5;
Object text = "Some text";
int α = 10;
double β = 20;
System.out.println(β - α);
Combining marks allowed in identifiers can graphically extend upwards when a program renders a variable name (or other identifier) in Java source code:
int xͫ = 14;
xͫ++;
System.out.println(xͫ);
Output:
14
Combining marks are often small, which can make it hard to differentiate identifiers:
void someͫMethod() {
System.out.println("Failure");
}
void someMethod() {
System.out.println("Success");
}
public static void main(String[] args) {
someͫMethod(); // Prints "Failure"
}
Output:
Failure
Unicode characters can be used to produce graphically disruptive identifiers in source code. This is often achieved through the usage of combining diacritical marks:
int x̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = 10;
This can produce visual clutter in IDEs and other text editors. Following is an example image of how Eclipse IDE renders the above code:
Changing the zoom level in Eclipse affects character rendering and the height that the stacked combining diacritical marks can reach far enough to cover text two lines above:
When used profusely, identifiers with combining characters may distend and cover other parts of code, as the following declaration may do in various IDEs:
int v̶̫͗̾̀a̷̻̟̿̂́̿ṛ̴̡̢̳͒i̵̮̾̇͊͠ả̷͍͂̈́͝b̵͍̠̬̼̊͑l̷̰̩̍͗̈́e̴͕̩̗͑̔͋ͅ = 12;
Unicode characters can also be used to produce graphically equivalent
variables which are actually different. This can be done using
various characters that have no visual appearance, such as the
zero-width space (\u200B
):
boolean javaref = true;
boolean javaref = false;
if (javaref)
System.out.println(javaref);
else
System.out.println(javaref);
Output:
false
The above code snippet contains a zero width space in the occurrences
of javaref
in the if
's condition and the
second print statement. The if
's condition therefore
evaluates to false
and the else
block is
executed, printing false
.
Character
class. See the isJavaIdentifierStart(int)
and isJavaIdentifierPart(int)
method documentation for more details.int
variable above.