I stumbled upon the hashtag 20books on Mastodon where people were posting the books that influenced them the most. The idea behind the hashtag can be seen in the first toot that I could find:
This “challenge” to post one book per day was running for quite a while with numerous participants and I was wondering if I had missed cool books, that I should add to my reading list. And here the idea was born to somehow list all books in an searchable way. The problem is, there are only pictures of covers and no metadata in the posts.
How the data extraction was done
After some attempts with Optical Character Recognition or by use of the ALT-text of the pictures, that went nowhere, I though this could well be my first real world task for GenAI. Extracting information out of pictures and combing it with the knowledge of the internet seemed quite a good fit for this tool.
In short words I downloaded the preview pictures of all toots that included the hashtag. That was not hard to do as most mastodon instances have an REST API that can be question to return toots with a certain hashtag. From the result I extracted the link to the picture.
A compression was needed to reduce the size for the next step that is to ask ChatGPT to analyze the picture. I asked this LLM to return the author, title, year of first publishing, genre and sub-genre by utilizing the OpenAI API to place the query for all pictures.
Then finally looked for identical entries and counted them.
Limitations
This process comes with some limitations, most of them are connected with the use of GenAI
- WRONG IDENTIFICATION
I check with examples, and the results a really good, still there is no guarantee.
- WRONG METADATA
Especially the year needed manual correction and i possibly did not find all error
- WRONG SYNTAX
Wrong sequence (e.g. author and title switched), missing comma, missing values etc.
- BOOK POSTED IN MULTIPLE LANGUAGES
I did not correct for that, so a book listed in multiple language will be counted as multiple books
I still find the results conclusive, even though I know it’s not 100% correct.
The Results
In total I analyzed 6682 toots and was able to extract 5904 pictures of book covers.
The oldest one is from 20 MAY 2024 and I stopped collecting at 31 AUG 2024.
Let’s look at the extracted titles first. The TOP 20 is a list of common books, but the count of each book already implies, that we have a wide distribution! Actual most books were posted only once (92%), it’s a really long tail distribution. This means, that actually mostly uncommon books are uploaded. The people posting under the hashtag are less mainstream that the Top book list suggests.
All the same can be said for the TOP-20 Authors the distribution of those.
We can see some differences between the top books and top authors list. For example No.1 in the title list – J.R.R. Tolkien – is not No. 1 in the book list. The reason is he wrote more that one book scoring high. With Douglas Adams it is different. The Hitchhiker’s Guide to the Galaxy was almost the only title mentioned, but in different languages. As I did not correct for that, they are counted as separate books, see limitations.
Looking at the years of first publishing, we see a wide range with a strong cluster is modern books.
The oldest book I found was Homers Ilias, estimated to be published in 750 BCE.
Some books were placed at estimated values around the year 0.
But more recent books are clearly #20books mastodoners favorites with a high plateau between 1960 and 2000.
I could not make anything out of the genres or sub-genres.
All data
If you want to got though the list and search for genres or authors etc., you can download the extracted data here:
https://github.com/HenningVajen/20books_mastodon
It’s all .csv files, that can be easily imported to spreadsheet editors like Excel or Numbers
Feel free to use it. If you happen to make something out of the data, please give me a ping. Maybe you can use the identified genres.