Recently, I have started to highlight text in my Kobo ereader as I go through a book. Once I've created a few bookmarks, I would like to somehow import them to my computer, ideally in a format such as markdown (since I like to use Obsidian to organise my notes), so I built a small program do it.

My ereader bookmarks

To be more specific about what I wanted, my erader has the option of selecting parts of the text in a book and highlighting it. This highlight is then saved and, in my ereader, I can check all the text that I have highlighted for a given book. It's also possible to add annotations to the highlighted text. This annotations are free text that is linked to the highlight and that can be access from the erader menu as well.

The problem is that I couldn't find a way of taking these bookmarks (the highlighted text and the potential annotations) out of the ereader and into my computer. Ideally, I would have some program that I could run on my computer when my ereader is connected; this program would be able to understand the bookmarks in the ereader, get both the highlights and the annotations, and understand from which book they are coming from. Then it would group them in a convenient format.

Designing the application

Since I could not find anything close to what I wanted, I decided to build my own application. It could be a CLI application written in Python using Typer library. I had used Typer before and I knew it works very well for this kind of small CLI utilities. I also could used Rich to have a somewhat nice interface. I have been wanting to try it out for some time and this seems like a good project to give it a go.

The other thing I needed was a way of retrieving the necessary information from my ereader, this is, I needed a way for my program to obtain a list of the bookmarks in my ereader, the highlighted text, and the potential annotations. It also needed to know to which book each bookmark corresponded to (the book title and author). I started looking and I found a few website like this one that explain that you can easily export your bookmarks as a .txt file by changing the configuration of the ereader. It would be easy to take this .txt file, parse it to extract the information I wanted, and format it into something like markdown.

But as I looked into this in more detailed, it was clear that it was not a good approach. For once, in order to create the .txt files, I would need to select each book in my ereader and tell it to export the highlights before my computer could retrieve them. This is not a major problem, but a minor annoyance. The main issue was that the file containing the bookmarks didn't appear to have a proper structure. All bookmarks from the same book would be added to a single file without any separator between them beyond a few empty lines (the number of empty lines wasn't consistent either). What's more, if the bookmark had annotations, the annotation text would mix with the highlighted text. Because all of this, parsing the contents of the .txt file to retrieve the bookmarks in a somewhat general way would be nearly impossible.

When I realised this, I started digging around at the files in my ereader and I found a KoboReader.sqlite file. This file corresponds to a SQLite database and, after inspecting the database, I found out plenty of information. Notably, there were two tables that could be useful for this project. The first one was a Bookmark table that looks like this (here only the relevant columns are shown):

VolumeIDTextAnnotationUUID
<ebook file path><highlighted text><annotated text><universally unique identifier>
............
............

I could easily query this table from my application to obtain the necessary information using a SQLite libary. Since this is a proper database, instead of a simple .txt file, no parsing is necessary. I noticed that, in the Bookmark table, some rows didn't have any data in the Text, nor in the Annotation entries. I deduced that these correspond to bookmarking a page, which is something I can also do on my ereader. I was not interested in this kind of bookmarks, but I should be able to easily ignore them when querying the database.

The other interesting table that I found in KoboReader.sqlite is a Content table that contains information about all the books that are currently stored in the ereader. It looks like this (again, some irrelevant columns have are not being shown):

ContentIDTitleAttributionContentType
<ebook file path><book title><book author><type of entry>
............
............

The column ContentID in the content table represents the same information as the VolumeID in the Bookmark table. The column ContentType is an interesting one. There are two types of entries in the content table, general information about each book and information regarding specific chapters or section of the book, the former have a value of 6 for the ContentType and the later have a 9 (I don't know the reason for this numbering choice).

For the entries that correspond to general information about books, the columns Title and Attribution contain the book title and author respectively.

So the plan was clear, use Typer to build a CLI application that could query the SQLite database, retrieve the bookmarks and book information, and format them nicely in a makdown document. It would also be able to display information about the bookmarks and for that I would use Rich. The application will also need a way of keeping track of the bookmarks that it has already imported; so that it is able to only import new bookmarks when it's called multiple times.

The technical details

So that's exactly what I built. I called the final application Kobo Highlights, and it can be called from the terminal using the command kh. It supports the following subcommands:

  • kh config: It is used to manage the configuration of the program. Basically Kobo Highlights needs to know two things, where my erader is mounted and where I want the final markdown files to be created. These two things are stored in a config file, and the config command can be used to see the current configuration and to create a new one. Kobo Highlights can't work with a proper configuration, so if it is called and it can't find one, it will prompt the user to create one interactively.

  • kh ls: It lists the available bookmarks. By default it will only list new bookmarks, this is, bookmarks that are on the ereader but have not been imported yet. The -all flag can be use to list all the bookmarks in the ereader instead, independently of whether or not they have been imported. The ls command will list, for each bookmark, the highlighted text, the annotation (in case there's one), the book title and author, and the ID. The ID of the bookmark is the UUID assigned to the bookmark by the ereader, which turned out to be a very useful field. The book title and author are queried from the content table. To do so, a query is performed with the right ContentID (the VolumeID value from the bookmark table) and forcing ContentType to be 6.

  • kh import: Finally, the main point of the tool, this command will import a set of bookmarks and store them as a markdown document. The markdown documents will follow a simple structure in which the highlighted text is included as a block quote, and the potential annotations are added bellow. The import command can be called with multiple options; it supports all to import all bookmarks in the ereader and new to only import the ones that weren't imported before, but it also supports being called with a list of IDs, a book title, or a book author.

One thing I had to consider was how to keep track of the bookmarks that have already been imported. Originally I queried the markdown document and looked at the text inside each block quote. This worked ok, but it had some problems.

The main issue is that it's not easy to parse markdown text. This is because markdown is a format designed for human readability, unlike JSON, which is designed mainly for serialization. In order to parse the markdown files, I first convert them to html and then use Beautiful Soup to parse the html text. This works fine in most cases but it's only a matter of time before some estrange edge case breaks it.

The other issue is that, sometimes, I may want to modify the highlighted text. An example of this is when I highlight part of a sentence and I want to add some punctuation or change the capitalization. The current markdown-to-html parser allows to add emphasis to the highlighted text, but, as soon as you modified a single character, it will not recognize the bookmark, and it will re-import it the next time it's called.

In order to solve these issues, I got rid of my markdown parser and instead store the IDs (I told you they are a useful field) of all the bookmarks that have been imported in a hidden JSON file in the same directory as all the markdown files. Then, every time the application needs to know what highlighted have been imported, all it needs to do is load this JSON file.

Another thing to point out is that when querying the ereader database, the application doesn't access the SQLite file directly, instead it creates a local copy of the file in the computer and queries the data from the copy. Once it's done, it deletes the local file. The reason for doing this is that the database doesn't only contain the Bookmarks, but also a lot of data that seems important for the ereader's functionality. Since I'm not that familiar with SQLite, I'd rather not interact with the ereader database directly, but use a copy instead, just in case.

Where to find the program

In case you want to have a look or try Kobo highlights, it is available on pypi and the entire source code is hosted in github.