Identifiers

A string of characters used to name (or otherwise identify) a Java construct.

Identifiers are used in source code for the names of variables, methods, types, type parameters, and packages. Each identifier is parsed as a separate token during tokenization of a Java program. Keywords are not considered identifiers.

For example,
int x = 10;

x is an identifier.

Syntax

Identifiers consist of:

1 ident-start-charident-body-chars

where...

ident-start-char is either _, $, or any valid letter (see below).
ident-body-chars is a sequence of characters composed entirely of: either _, $, or any valid letter or digit (see below). The sequence may be of any size.

such that...

Syntax Elements

1 An identifier, used to name a variable, method, class or other type, type parameter, etc.

Composition

Java identifiers are composed of Unicode code-points (characters), defined by the Unicode standard. The code-points that can comprise an identifier are defined by the static Java API methods Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int), for the ident-start-char and each character in ident-body-chars, respectively; if either method returns true for a character, then the respective syntax element can be composed of that character in a valid identifier.

The Unicode Standard technically specifies code-points, not characters. In Java (and the Java specification), characters are 16-bit data stored by the well-known char type. Strings are sequences of chars. Some Unicode code-points are now larger than 16 bits, so two characters in Java are required to represent them (totaling 32 bits). (This is why some methods in the Character and String classes are overloaded to accept an int, which is 32 bits.) Because of such, the Java documentation distinguishes between characters and code-points, the latter being represented by an int in Java, and the former being represented by a char. Code-points are what most understand a character to be: A concept of a letter, digit, or other symbol (primarily) used for communication.

When referring to a Unicode code-point, this document uses the term code-point.

Permitted Characters for ident-start-char

The first character of a Java identifier may be any Unicode code-point whose general category:

Unicode groups characters into blocks and into general categories (and sub-categories). Blocks are generally used to group code-points by language, function, etc. A character can only be in one block and blocks are contiguous. General Categories are an attribute of code-points. Each code-point is assigned exactly one general category and sub-category (known as the Major and minor categories, respectively).

Note that this encompasses the overwhelming majority of Unicode code-points; more than 130 thousand characters can be used to begin a Java identifier.

Permitted Characters for ident-body-chars

Subsequent characters in a Java identifier may be any code-point that is ignored (see below), can begin a Java identifier, or any codepoint whose general category:

Ignored characters are code-points for which Character.isIdentifierIgnorable(int) returns true. This includes those whose general category is Cf (as listed above), and all code-points from:

These characters are not actually ignored, but are rather allowed as subsequent characters in an identifier.

Examples

Simple Variable Declarations

int x, y, z;
double dbl1, dbl2, dbl3, dbl4, dbl5;
Object text = "Some text";

Identifiers with Greek Letters

int α = 10;
double β = 20;
System.out.println(β - α);

Combining Mark in Identifier

Combining marks allowed in identifiers can graphically extend upwards when a program renders a variable name (or other identifier) in Java source code:

int xͫ = 14;
xͫ++;
System.out.println(xͫ);

Output:

14

Combining marks are often small, which can make it hard to differentiate identifiers:

void someͫMethod() {
	System.out.println("Failure");
}

void someMethod() {
	System.out.println("Success");
}

public static void main(String[] args) {
	someͫMethod(); // Prints "Failure"
}

Output:

Failure

Examples From Code Snippets

Notes

  1. Unicode characters can be used to produce graphically disruptive identifiers in source code. This is often achieved through the usage of combining diacritical marks:

    int x̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ = 10;

    This can produce visual clutter in IDEs and other text editors. Following is an example image of how Eclipse IDE renders the above code:

    Changing the zoom level in Eclipse affects character rendering and the height that the stacked combining diacritical marks can reach far enough to cover text two lines above:

    When used profusely, identifiers with combining characters may distend and cover other parts of code, as the following declaration may do in various IDEs:

    int v̶̫͗̾̀a̷̻̟̿̂́̿ṛ̴̡̢̳͒i̵̮̾̇͊͠ả̷͍͂̈́͝b̵͍̠̬̼̊͑l̷̰̩̍͗̈́e̴͕̩̗͑̔͋ͅ = 12;
  2. Unicode characters can also be used to produce graphically equivalent variables which are actually different. This can be done using various characters that have no visual appearance, such as the zero-width space (\u200B):

    boolean javaref = true;
    boolean javaref​ = false;
    if (javaref​)
    	System.out.println(javaref);
    else
    	System.out.println(javaref​);

    Output:

    false

    The above code snippet contains a zero width space in the occurrences of javaref in the if's condition and the second print statement. The if's condition therefore evaluates to false and the else block is executed, printing false.

External Links