The native JavaScript String.length is a count of 16 bit code points in a string, which can present a problem in some cases when counting characters and splitting strings in internationalized web applications because some characters are comprised of two code points.

Problem

String.length is a count of 16 bit code points in a string, which is good for internationalized applications that rely on Unicode when determining the maximum possible amount of space a string could occupy when stored in the back end, e.g., a database column. However, it does not work well for limiting text based upon character count because a single character could be compromised of two code points.

For example, in Devanagari (Hindi) थि is actually two 16 bit code points. थ is a consonant and ि is a vowel, which cannot stand by itself because it is a combining mark. As a result ‘थि’.length returns 2 in JavaScript, which is accurate because String.length is designed to count code points not characters. However, if you are counting characters for display not size purposes then this would be wrong. String.length is not faulty it is just being used incorrectly. Another problem can occur when splitting strings.

When a string contains letters and combining marks there is a possibility that when the string is split the results could contain grammatically incorrect values or letters with unintended meaning. Returning to the Devanagari example, if थि is split, ‘थि’.slice(1,2), then the value returned is a combining mark that is not intended to be used without a consonant.

Solution

When splitting international strings and counting characters for display purposes care should be taken to differentiate between letters and marks. For example, the code below uses the equivalent of {M} (see letters and marks; range truncated for display purposes) to match any mark, such as the combining mark in the Devanagari example, then decrements the String.length count for each mark, and returns the character count. Unicode regular expressions do not exist in native JavaScript, but Steven Levithan has written an excellent Unicode plugin for his XRegExp library. Unicode block ranges can also be created using http://kourge.net/projects/regexp-unicode-block. Please note that I am not a language expert. The code points and ranges below should be adjusted to meet your specific needs.

(function () {
    var M = '\u0300-\u036F\u0483-\u0489\u0591-\u05BD ... \uFE20-\uFE26',
        regex = new RegExp('[' + M + ']', 'g');
    
    String.prototype.charCount = function () {
        var len = this.length, i = 0, count = len;

        for (i; i<len; i++) {
            if (regex.test(this[i])) 
                count--;
        }

        return count;  
    }     
}());

Conclusion

String.length works as designed, but one must be careful not misconstrue its intended usage when developing internationalized web applications using JavaScript. String.length should be used when measuring a string’s size in terms of code points, but an alternate approach should be used when counting characters for display. Lastly, one must exercise caution when splitting strings else the results could be nonsensical.