Scraping AO3 Tagset Changes
Recently people on a fanfic exchange discord mentioned it would be nice to see which tags were most recently added to the Bulletproof AO3 tagset. Let’s go over a hacky way to do this.
I looked at the tagset’s webpage (Warning for nsfw text and dark themes) and guessed that it would be easy to get the list of tags from it. That was the only part that might have been hard1, so I decided to get stuck in.
The hacky way
I used the command line aka bash. I should note that for any serious scraping you should use a normal programming language and maybe a library that can help you with getting data off the site, like I don’t know, AO3.js or this Python AO3 package. This would also help you to host your project on a web page later.
However, I want to demonstrate that you can quickly do something that’s basically okay with bash.
The task
We need to…
- Download a webpage’s HTML on the command line
- Search the HTML for the list of tags
- Put each tag on its own line
- Save the formatted list of tags as a file
- Compare two files to see which lines they do/don’t have in common
- Save the comparison to a file
- Do this every hour (or 6 hours, etc.)
Knowing how to break the task down like this is the part which requires a bit of experience. But hey I got all my experience from scraping ao3 and tumblr and other hobby stuff.
In the next parts, I will use a bunch of unfamiliar words which are very googleable and stackoverflow-able.
Getting all tags
Firstly, I go to the webpage, press Ctrl+Shift+I and use the picker to click on a tag in the list. I see that all tags are listed in the html as <li>Some text WITHOUT angle brackets</li>
and I ctrl+f <li>
to be reasonably sure that nothing else is listed in the same way.
I open my terminal. Let’s download the page HTML with curl
command. Then let’s use grep
command and a regex to reduce it down to anything matching <li>Some text WITHOUT angle brackets</li>
.
We send the HTML from curl to the grep command with a pipe |
curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>'
Result gets automatically printed out (after a thing that tells you about curl
’s download progress):
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0<li>Graphic Depictions Of Violence</li>
<li>Major Character Death</li>
<li>No Archive Warnings Apply</li>
(etc...)
Looks good, we printed all the tags out this way.
Following the same principle, we can get each tag on its own line, with no blank lines, sorted, using the commands sed
(for search and replace) and sort
(for sorting). This will be helpful for making comparisons.
curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>' | sed -r 's#</?li>#\n#g' | sed '/^ *$/d' | sort
Result:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 102k 0 102k 0 0 19557 0 --:--:-- 0:00:05 --:--:-- 25280
A's cat claims B before A knows A loves B
A's Love Language Is Killing B's Enemies And Dropping Them At B's Feet Like A Cat Gifting Dead Mice
(etc...)
Did I know how to do this correctly at first? No, I just googled what I wanted to do. “how to remove all blank lines in bash” and stuff like that. Someone else has always done it first.
Comparing old and new versions of the tagset
We’ve been printing this stuff out but we can save it to a file instead by adding > filename
at the end. I saved mine to a file called fake_older_tagset, opened it in Notepad and added and removed entries so I could test if comparisons worked.
I googled and found a command that would do the comparing for me, called comm
.
Also I decided to move to a bash script so I could reuse things without getting confused. In Notepad, I made a file called scrape.sh.
Inside I write someting that will store a timestamped filename in a shell variable myfilename, using date
to generate the timestamp. Then I save the scraped tag list to myfilename. Then I use this file twice with comm
.
myfilename=scraped_$(date +'%Y-%m-%d_%H-%M-%S')
curl https://archiveofourown.org/tag_sets/18043 | grep '<li>[^<]*</li>' | sed -r 's#</?li>#\n#g' | sed '/^ *$/d' | sort > "$myfilename"
echo "Added:"
comm -23 "$myfilename" fake_older_tagset
echo "Removed:"
comm -13 "$myfilename" fake_older_tagset
Back on the command line I run chmod +x scrape.sh
to make scrape.sh executable. Now I can re-run this sequence of commands any time by entering ./scrape.sh
to the terminal. Let’s see the result:
./demo.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 102k 0 102k 0 0 25213 0 --:--:-- 0:00:04 --:--:-- 25219
Added:
Epistolary
Removed:
My Fake New Tag
We did it! Now, assuming we have a copy of the taglist from 6pm, and the taglist gets updated at 6:01pm, then running this script anytime after 6:01pm will get us a list of all added and removed tags…
Turning this into a program which automatically runs on the hour and collects all the added and removed tags is left as an exercise to the reader, who should now be convinced of the power of googling. Feel free to message me about it or request a part 2.
-
Because I wasn’t going to using anything more complicated than regexes, and technically regular expressions are not powerful enough to parse HTML. But as usual the HTML was simple enough. ↩︎