Use Google Fonts for Machine Learning (Part 2: Filter & Generate)

Quickly generate hundreds of font images.

November 01, 2020

Machine learning training time-lapse. Image by Author.

Let’s pick up from where we left off in the last post. We cleaned up the JSON annotations and exported it as a CSV file. Now, let’s load the CSV file back in, and filter them with conditions we define and generate PNG images so that we can use the images to train Convolutional Neural Networks.

Directory Setup

Again, I am treating this post as a Jupyter Notebook, so if you want to follow along, open Jupyter Notebook in your own environment. You will also need to download Google Fonts Github repo and generate the CSV annotations. We went over all of these in the last post, so please check it out first.

As you can see from the tree diagram below, the Google Fonts Github repo is downloaded into input directory. All the font files are stored in ofl subdirectory, which is how it is originally set up and I have not made any changes to it. I have also set up notebook directory with .ipynb notebook (this post) as well as the CSV annotation file.

.
├── input
│   ├── fonts
│   │   ...
│   │   ├── ofl
│   │   ...
│   ├── fonts-master.json
└── notebook
    └── google-fonts-ml
        ├── google-fonts-ml.ipynb
        └── google-fonts-annotaion.csv

Load CSV Annotations

First, we will load the CSV annotation as pandas.DataFrame.

	# modules we need this time
	import matplotlib.pyplot as plt
	from PIL import Image,ImageDraw,ImageFont
	import pandas as pd
	import numpy as np

	# load the CSV we created in the last post
	df = pd.read_csv('./google-fonts-annotaion.csv')

view raw gfonts-2-loadcsv.py hosted with ❤ by GitHub

Look at Data Summary

Let’s have a look at what we are dealing with here — how many fonts are within each category, how many fonts have which variants, etc. We first filter the list of fonts using a regular expression to select the columns we want to look at. Find all the rows with the value of 1, sum all these rows, and sort by frequency.

	print('Number of fonts in each variant:')
	print(df[df.filter(regex=r'^variants_', axis=1).columns].eq(1).sum().sort_values(ascending=False))

	print('\nNumber of fonts in each subset:')
	print(df[df.filter(regex=r'^subsets_', axis=1).columns].eq(1).sum().sort_values(ascending=False))

	print('\nNumber of fonts in each category:')
	print(df[df.filter(regex=r'^category_', axis=1).columns].eq(1).sum().sort_values(ascending=False))

view raw gfonts-2-summary.py hosted with ❤ by GitHub

At the time of writing, for instance, there are 325 display, 311 sans-serif, and 204 serif font families. The number of fonts is not the same as the number of font files, however, because the CSV annotation is only counting the font families and there are other factors such as variable fonts, etc.

Filtering Fonts

Now, let’s create conditions to create sub-groups of fonts. Because we took the time to create a nice CSV annotation it is easy to filter fonts according to our own criteria. Say, how many fonts that are thin, regular and thai?

	mask = (df.filter(regex='thin', axis=1).sum(axis=1).astype(bool) &
	df.filter(regex='regular', axis=1).sum(axis=1).astype(bool) &
	df.filter(regex='thai', axis=1).sum(axis=1).astype(bool))
	df_selected = df[mask]
	df_selected

view raw gfonts-2-filter1.py hosted with ❤ by GitHub

The reason we sum(axis=1) after filtering is to capture multiple columns if they meet the same condition. For example, if we filter with italic, there are multiple columns that will be included such as bolditalic, mediumitalic, etc. We can combine multiple conditions to a boolean Series object called mask and pass it to our DataFrame object. It will only return the rows that meet the conditions. We use & to only select the fonts that meet all the conditions.

Here is another one — Let’s find the fonts that support chinese and sans-serif.

	mask = (df.filter(regex='chinese', axis=1).sum(axis=1).astype(bool) &
	df.filter(regex='sans-serif', axis=1).sum(axis=1).astype(bool))
	df_selected = df[mask]
	df_selected

view raw gfonts-2-filter2.py hosted with ❤ by GitHub

I get 3 fonts back that meet the criteria — NotoSansHK, NotoSansSC, and NotoSansTC. For some reason, these NotoSans fonts were not included in the Google Fonts Github repo. So, you will have to manually download them if you need to use them.

We don’t want to keep editing the mask object each time we use a different set of filters, so we will make a list with all the filters we want to use and simply iterate over them. These strings are in fact regular expressions and you can try more advanced filtering on your own.

	regex_filters = ['_regular', '_japanese', '_serif']

	df_new = pd.concat([df.filter(regex=regex, axis=1).sum(axis=1).astype(bool) for regex in regex_filters], axis=1)
	mask = df_new.all(axis=1)

	df.loc[mask]

view raw gfonts-2-filter3.py hosted with ❤ by GitHub

How about we make this into a reusable function? This function will take DataFrame and regex filter list as inputs, and outputs a list of font names.

	import re

	def filter_fonts(df, regex_filters):
	df_new = pd.concat([df.filter(regex=re.compile(regex, re.IGNORECASE), axis=1).sum(axis=1).astype(bool) for regex in regex_filters], axis=1)
	mask = df_new.all(axis=1)
	return list(df.loc[mask].family)

	filtered_fontnames = filter_fonts(df, ['Black', 'cyrillic', '_serif'])
	filtered_fontnames

view raw gfonts-2-filter-func.py hosted with ❤ by GitHub

Some Caveats

Before we move on, I want to point out some issues that I found. We are relying on the font data from the CSV annotations that we generated from the Google Fonts Developer API, but there are some discrepancies.

First, the number of fonts present in the annotations is not the same as the number of font files we downloaded.
Some fonts that are not indicated as latin still have latin alphabets.
NotoSansTC is present in the annotation but is not included in the zip archive. Probably because some languages such as Chinese or Korean have a lot of characters and font files get much bigger compared to latin fonts. It’s just my guess…
A font called Ballet — the font file exists but it’s not in the annotation. There may be a few other cases like this.
The name of the folder containing font files is mostly the same as the name of the fonts themselves but there are some exceptions. For example, a font called SignikaNegative is inside a folder called signika.
A font called BlackHanSans is obviously a very thick black weight, but it is listed as regular weight.

So, the annotation is not the complete representation of the font files we downloaded. As a whole, I think it is still a very valuable resource that we have.

Load Font File Path

We will use glob to load the fonts path. If you only want to filter fonts through weights such as Regular, Bold, Thin, simple filename checking will be enough as the weights/styles are already part of the file names, but you won't be able to use other annotations we looked at such as language subsets or category.

	import glob

	# change the root directory according to your setup
	ROOT = '../../input/fonts/ofl'

	# some fonts are within /ofl/fontname/static directory (2 levels deep)
	all_fonts_path = glob.glob(ROOT + '///*.ttf', recursive=True)
	print('number of font files in total: ', len(all_fonts_path))

view raw gfonts-2-all-fonts-path.py hosted with ❤ by GitHub

number of font files in total:  8708

Filter Fonts

It is time to bring different pieces together into a single function that will take a DataFrame object and the path to font files, and then use the filters we define to return a list of filtered font paths.

	def filter_fonts_get_paths(df, root='./', variants=['_'], subsets=['_'], category=''):
	# exceptions
	if not variants or variants == [''] or variants == '': variants = ['_']
	if not subsets or subsets == [''] or subsets == '': subsets = ['_']
	# apply filters
	regex_filters = variants + subsets + ['_'+category]
	df_new = pd.concat([df.filter(regex=re.compile(regex, re.IGNORECASE), axis=1).sum(axis=1).astype(bool) for regex in regex_filters], axis=1)
	mask = df_new.all(axis=1)
	filtered_fontnames = list(df.loc[mask].family)

	# construct file paths
	paths = []
	for fontname in filtered_fontnames:
	if variants == ['_']: # select all variants
	sel = glob.glob(f'{root}/{fontname.lower()}/*/.ttf', recursive=True)
	paths.extend(sel)
	else:
	for variant in variants:
	sel = glob.glob(f'{root}/{fontname.lower()}/**/{fontname}-{variant}.ttf', recursive=True)
	for path in sel:
	paths.append(path)
	print(f'Found {len(paths)} font files.')
	return paths

	# let's try the function
	paths = filter_fonts_get_paths(df, root=ROOT, subsets=['latin'], variants=['thinitalic'], category='sans-serif')

view raw gfonts-2-filter-fonts-get-paths.py hosted with ❤ by GitHub

To describe what the function is doing — What we want to do is to look at both filtered_fontnames and its corresponding file path and check if the file exists. If so, the function returns a path to the file.

From the font names, we can retrieve the file names so that we can load each font and later generate images. We filter fonts with weights, styles, subsets, and categories, but the file names only contain font names and weights/styles.

A few things to note here:

Depending on your OS, you may need to give case-sensitive filters. For example, Bold, LightItalic may work but bold, lightitalic may not. I am using a Mac, and it is case-insensitive. You also cannot combine mutually exclusive conditions together, such as filtering sans-serif and serif at the same time. You can, however, input a regular expression to create a more complex filter.

Some variable fonts are not selected (ex. CrimsonPro, HeptaSlab) because they have different naming conventions than others. I just ignored them because they are not that many, and the variable fonts that have static versions are searchable through the filters.

Preview Font Images

It is finally time to see the fonts displayed as images.

	import random

	IMG_HEIGHT = 512
	IMG_WIDTH = 512

	paths = filter_fonts_get_paths(df, root=ROOT, subsets=[''], variants=['bold'], category='')
	r = random.randrange(0, len(paths))

	# sample text and font
	text = "G"
	text_size = 400
	x = IMG_WIDTH/2
	y = IMG_HEIGHT*3/4
	font = ImageFont.truetype(paths[r], text_size)

	# get text info (not being used but may be useful)
	text_width, text_height = font.getsize(text)
	left, top, right, bottom = font.getbbox(text)
	print('text w & h: ', text_width, text_height)
	print(left, top, right, bottom)

	# create a blank canvas with extra space between lines
	canvas = Image.new('RGB', (IMG_WIDTH, IMG_HEIGHT), "black")

	# draw the text onto the text canvas
	draw = ImageDraw.Draw(canvas)
	draw.text((x, y), text, 'white', font, anchor='ms')

	plt.imshow(canvas)

view raw gfonts-2-preview.py hosted with ❤ by GitHub

Everything Pieced Together

Here is the code block that pieces everything together. This time, all the images will be saved as PNG files.

	from IPython.display import clear_output

	# display images
	monitor = False

	# font filtering
	df = pd.read_csv('./google-fonts-annotaion.csv')
	paths = filter_fonts_get_paths(df, root=ROOT, subsets=['latin'], variants=['bold'], category='serif')

	# text setup
	text = "G"
	x = IMG_WIDTH/2
	y = IMG_HEIGHT*3/4
	text_size = 400

	# create save directory
	target_dir = f'./fonts-images/{text}'
	!mkdir -p $target_dir

	# drawing setup
	canvas = Image.new('RGB', (IMG_WIDTH, IMG_HEIGHT), "black")
	draw = ImageDraw.Draw(canvas)

	for i, path in enumerate(paths):
	font = ImageFont.truetype(path, text_size)
	draw.rectangle((0,0,IMG_WIDTH,IMG_HEIGHT),fill='black')
	draw.text((x, y), text, 'white', font, anchor='ms')
	canvas.save(f'{target_dir}/{i:05}.png', "PNG")

	if monitor:
	clear_output(wait=True)
	plt.imshow(canvas)
	plt.show()
	if i % 200 == 0:
	print(f'exporting {i:05}th of {len(paths)}')

	print('Exporting completed.')

view raw gfonts-2-everything.py hosted with ❤ by GitHub

Conclusion

Now, you can use the generated images as inputs to Convolutional Neural Networks such as Generative Adversarial Networks or Autoencoders. That is what I am going to be playing the images with. I am sure you can find examples of the model architectures pretty easily on Medium and elsewhere.

I hope you found my post useful — please let me know if you make some fun stuff with the fonts!

If you liked my contents, please consider supporting. Thank you!

Tezos donation address:

tz1cr64U6Ga4skQ4oWYqchFLGgT6mmkyQZe9

Kofi donation link:

Erratic Generator

Creative Coding and Design