Which languages are supported in pdfHTML?
This is not a new question. The answer can be found in chapter 6, but this question is asked so frequently that an extra entry in the FAQ section is justified. It's also an occasion to provide some extra information, e.g. about Google's Noto fonts project.
The easy answer to this question is Every language is supported as long as you provide the appropriate font. For instance: if you want to convert an HTML page with a combination of Chinese and English text, you will only see the English text, unless you provide a font that knows how to render Chinese glyphs. This is explained in chapter 6.
A more nuanced answer is Most languages are supported as long as you provide the appropriate font, and as long as you use pdfCalligraph.
If you want to convert an HTML page that contains text in an Indic language, such as Hindi, Kannada, Tamil, Telegu,... iText will have to make ligatures, and the pdfCalligraph add-on is required to achieve this;
If you want to convert an HTML page that contains text in Hebrew, the writing system has to be changed from left-to-right to right-to-left; this also requires the pdfCalligraph add-on;
If you want to convert an HTML page that contains text in Arabic, the writing system has to be change to right-to-left and ligatures have to be made.
These are only some examples of languages, or rather writing systems, that are supported by the pdfCalligraph add-on. New writing systems are added on a regular basis.
Important: pdfCalligraph is not available under an open source license.
We noticed that too many companies were violating the AGPL license, in the sense that they were distributing iText in a closed source application, without purchasing a commercial license. We have successfuly taken legal action against some of these companies, but since it's much easier to prevent abuse than it is to fight abuse. Hence the decision of distributing specific components of iText, such as pdfCalligraph, as proprietary software using a closed source licence. This way, companies have to establish a business relationship with iText Software if they want to use these add-ons. This gives us the opportunity to explain that free / open source software isn't freeware. The iText core library is released as AGPL software. The AGPL is a free software license, but that doesn't mean you can use iText for free in any circumstance. You can only use iText for free if you obey the rules of the AGPL, using the interpretation of the AGPL that is shipped with the source code of iText.
The screen shot below shows part of the peace.html HTML file that presents the word peace in a couple of hundred languages:
Peace in different languages (HTML)
It isn't complete. For instance, I don't know the translation of the word peace in Abkhaz or Afro-Siminole.
The C07E13_Peace example converts this HTML file to a PDF document as shown in the screen shot below:
Peace in different languages (PDF)
If you look closely, you notice that iText even supports languages such as Amharic, Anglo Saxon (extinct old English), Cherokee, and many other languages. How is this possible?
A glance at the Fonts tab in the Document Properties explains what happened:
Peace in different languages (PDF fonts)
We added the fonts/noto/ directory to the
FontProvider. In this folder, you will find 95
.ttf files taken from the Google Noto fonts project:
Google has been developing a font family called Noto, which aims to support all languages with a harmonious look and feel. Noto is Google's answer to tofu. The name noto is to convey the idea that Google's goal is to see "no more tofu". Noto has multiple styles and weights, and is freely available to all. The comprehensive set of fonts and tools used in Google's development is available in Google's GitHub repositories.
Note that tofu in the context of NOTO doesn't refer to food, but to the blocks resembling tofu that are shown when a glyph can't be rendered:
Glyphs that can't be shown are replaced with "tofu"