Gold is posting...

Scraping AO3 Tagset Changes

Posted on December 22, 2023 in

Recently people on a fanfic exchange discord mentioned it would be nice to see which tags were most recently added to the Bulletproof AO3 tagset. Let’s go over a hacky way to do this.

I looked at the tagset’s webpage (Warning for nsfw text and dark themes) and guessed that it would be easy to get the list of tags from it. That was the only part that might have been hard¹, so I decided to get stuck in.

The hacky way

I used the command line aka bash. I should note that for any serious scraping you should use a normal programming language and maybe a library that can help you with getting data off the site, like I don’t know, AO3.js or this Python AO3 package. This would also help you to host your project on a web page later.

However, I want to demonstrate that you can quickly do something that’s basically okay with bash.

The task

We need to…

Download a webpage’s HTML on the command line
Search the HTML for the list of tags
Put each tag on its own line
Save the formatted list of tags as a file
Compare two files to see which lines they do/don’t have in common
Save the comparison to a file
Do this every hour (or 6 hours, etc.)

Knowing how to break the task down like this is the part which requires a bit of experience. But hey I got all my experience from scraping ao3 and tumblr and other hobby stuff.

In the next parts, I will use a bunch of unfamiliar words which are very googleable and stackoverflow-able.

Getting all tags

Firstly, I go to the webpage, press Ctrl+Shift+I and use the picker to click on a tag in the list. I see that all tags are listed in the html as <li>Some text WITHOUT angle brackets</li> and I ctrl+f <li> to be reasonably sure that nothing else is listed in the same way.

I open my terminal. Let’s download the page HTML with curl command. Then let’s use grep command and a regex to reduce it down to anything matching <li>Some text WITHOUT angle brackets</li>.

We send the HTML from curl to the grep command with a pipe |

curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>'

Result gets automatically printed out (after a thing that tells you about curl’s download progress):

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0<li>Graphic Depictions Of Violence</li>
<li>Major Character Death</li>
<li>No Archive Warnings Apply</li>
(etc...)

Looks good, we printed all the tags out this way.

Following the same principle, we can get each tag on its own line, with no blank lines, sorted, using the commands sed (for search and replace) and sort (for sorting). This will be helpful for making comparisons.

curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>' | sed -r 's#</?li>#\n#g' | sed '/^ *$/d' | sort

Result:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  102k    0  102k    0     0  19557      0 --:--:--  0:00:05 --:--:-- 25280
A&#39;s cat claims B before A knows A loves B
A&#39;s Love Language Is Killing B&#39;s Enemies And Dropping Them At B&#39;s Feet Like A Cat Gifting Dead Mice
(etc...)

Did I know how to do this correctly at first? No, I just googled what I wanted to do. “how to remove all blank lines in bash” and stuff like that. Someone else has always done it first.

Comparing old and new versions of the tagset

We’ve been printing this stuff out but we can save it to a file instead by adding > filename at the end. I saved mine to a file called fake_older_tagset, opened it in Notepad and added and removed entries so I could test if comparisons worked.

I googled and found a command that would do the comparing for me, called comm.

Also I decided to move to a bash script so I could reuse things without getting confused. In Notepad, I made a file called scrape.sh.

Inside I write someting that will store a timestamped filename in a shell variable myfilename, using date to generate the timestamp. Then I save the scraped tag list to myfilename. Then I use this file twice with comm.

myfilename=scraped_$(date +'%Y-%m-%d_%H-%M-%S')

curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>' | sed -r 's#</?li>#\n#g' | sed '/^ *$/d' | sort > "$myfilename"

echo "Added:"
comm -23 "$myfilename" fake_older_tagset
echo "Removed:"
comm -13 "$myfilename" fake_older_tagset

Back on the command line I run chmod +x scrape.sh to make scrape.sh executable. Now I can re-run this sequence of commands any time by entering ./scrape.sh to the terminal. Let’s see the result:

./demo.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  102k    0  102k    0     0  25213      0 --:--:--  0:00:04 --:--:-- 25219
Added:
Epistolary
Removed:
My Fake New Tag

We did it! Now, assuming we have a copy of the taglist from 6pm, and the taglist gets updated at 6:01pm, then running this script anytime after 6:01pm will get us a list of all added and removed tags…

Turning this into a program which automatically runs on the hour and collects all the added and removed tags is left as an exercise to the reader, who should now be convinced of the power of googling. Feel free to message me about it or request a part 2.

Because I wasn’t going to using anything more complicated than regexes, and technically regular expressions are not powerful enough to parse HTML. But as usual the HTML was simple enough. ↩︎