Posts

Tweets to @realdonaldtrump; How many fucks are there to give?

I’ve been collecting tweets to @realDonaldTrump since June 2017. In my most recent time pulling together, and deduping the dataset I asked myself, “I wonder how many occurrences of ‘fuck’ are in the dataset.” Or, how many fucks are there to give? Well… The data is updated by running a query on the Standard Search API every five days. $ twarc search ‘to:realdonaldtrump’ –log donaldsearch$DATE.log > donaldsearch$DATE.jsonl Which yields something like this every five days.

Thumbnails in Warclight

One feature of Blacklight that I’ve always wanted to setup in Warclight is displaying thumbnails in the results display. Getting this setup is a bit tricky. But, since Warclight is standardizing metadata on webarchive-discovery’s Solr schema.xml, we avail ourselves to a number of fields available for use for a potential implementation. The url field is the obvious choice, but the problem is that Blacklight out of the box will try and display a thumbnail for every url field value you give to config.

Twitter Wordcloud Pipeline

At this past week’s Archives Unleashed dataton, I jokingly created some wordclouds of my Co-PI’s timelines. Finished my most likely bigly winning #hackarchives project: A Word Cloud of @lintool's timeline!https://t.co/eK2KPGjaGo — nick ruest (@ruebot) April 27, 2018 Or, @ianmilligan1 #HackArchiveshttps://t.co/qMxiet0osl — nick ruest (@ruebot) April 27, 2018 Mat Kelly asked about the process this morning, so here is a little how-to of the pipeline: Requirements: twarc jq wordcloud_cli.

The world is a beautiful and terrible place

This is the text for my presention at the “National Forum on Ethics and Archiving the Web”. I had the honour of being on an Archiving Trauma panel with some great people. Michael Connor, Chido Muchemwa, Coral Salomón, Tonia Sutherland, and Lauren Work, thank you for sharing your stories! The world is a beautiful and terrible place. Twitter can be beautiful. Twitter is fucking awful. So, capturing traumatic events on Twitter.

A Quick Benchmark of Webarchive-Discovery

This past week Compute Canada provided us with resources to setup our Solr Cloud instance for WALK and Archives Unleashed. We were able to get things setup relatively quickly thanks to a bit of preparation and practice on our local machines in the previous weeks. Once everything was setup (5 virtual machines total; 4 Solr Cloud nodes and one indexer – details below), we started benchmarking webarchive-discovery and our Solr Cloud setup with GNU Parallel.

See a Little Warclight

What if you have a few terabytes of web archive data setting around, and wanted to shine a little light into them? Well, the good news is that now you can! The British Library’s UK Web Archive initiative has created some great software over the last couple years to allow you to index your web archive content into Solr, and provide access to it in a discovery interface called Shine. You can check Shine out in action here (for the British Library’s collections) or here (for our Canadian politics one).

The Archives Unleashed Project: Warcbase is dead, long live the Toolkit

by Ian Milligan, Jimmy Lin, and Nick Ruest We were delighted to be able to announce a few months ago that our project team at the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Since that announcement, we’ve been busy at work at a few different things: modernizing and updating our Warcbase web archiving analytics platform, working on a discovery interface and underlying infrastructure, and laying the administrative groundwork for the project itself.

Twitter Bots

Introduction

List of bots I run, divided up by type.

anon

diffengine

YUDLbots

DPLA bots

Other

Twitter Datasets and Derivative data

Tweets to Donald Trump (@realDonaldTrump) 59,261,490 tweet ids for tweets directed at Donald Trump (@realDonaldTrump), collected with Documenting the Now’s twarc. Tweets can be “rehydrated” with Documenting the Now’s twarc, or Hydrator. twarc hydrate to_realdonaldtrump_ids.txt to_donaltrump.jsonl. Tweets from May 7, 2017 - June 21, 2017 of the dataset used a combination of the Filter (Streaming) API and Search API. The Filter API failed on June 21, 2017. From June 23, 2017 forward only the Search API was used to collect.

14,478,518 WomensMarch tweets January 12-28, 2017

Overview A couple Saturday mornings ago, I was on the couch listening to records and reading a book when Christina Harlow and MJ Suhonos asked me about collecting #WomensMarch tweets. Little did I know at the time #WomensMarch would be the largest volume collection I have ever seen. By the time I stopped collecting a week later, we’d amassed 14,478,518 unique tweet ids from 3,582,495 unique users, and at one point hit around 1 million tweets in a single hour.