The SLNSW Social Media Archive is a collaboration between the State Library of New South Wales and CSIRO Data 61. The SLNSW Social Media Archive is an archive of public Social Media content discussing life in New South Wales.
In the same way you can go into a library today and look through newspapers and other documentation from the early twentieth century for insights into life back then the SLNSW Social Media Archive seeks to create a collection of data that will allow people in the future to gain insights into how life is now.
The user interface was designed by Brian Jin, Stephen Wan, James McHugh and Cecile Paris at CSIRO Data61, in collaboration with Brendan Somes, Geoffrey Barker and Sean Volke from the State Library of NSW.
Data Collection for the Social Media Archive is largely based on text queries, augmented with subscriptions to selected government or community run accounts of public interest.
The queries and subscriptions are curated by SLNSW staff.
The search queries and subscriptions are organised into Activities, each representing different aspects of life in New South Wales.
|ARTS||Posts related to the Arts.|
|BUSINESS||Business in New South Wales.|
|EDUCATION||New South Wales education system.|
|ENVIRONMENT||New South Wales environmental discussion.|
|GOVERNMENT||Leisure Activity in New South Wales.|
|INDIGENOUS||Posts related to Indigenous issues and culture.|
|LEISURE||Leisure Activity in New South Wales.|
|MEDIA||New South Wales media.|
|POLITICS||New South Wales politics.|
|SPORT AND RECREATION||Sporting Clubs and Events.|
After collection Social Media Content is classified by type of media.
Mediatypes allow us to place content into broad categories based on the type of media the content is.
The current mediatype categories are:
|microblog||Twitter, Google+: Posts on microblogging websites.|
|public pages||Public pages, comments and other content.|
|picture||Instagram and other picture sharing platforms.|
|news||News media websites.|
|video||Youtube and other video sharing websites.|
|blog||Blogspot and Wordpress and other blogging websites.|
|forum||Posts and threads on Internet Discussion forums.|
|question||Yahoo Answers and other Question Answering websites.|
|misc||Content that we could not determine a type for.|
There is a very small amount of Social Media content that includes unambiguous geolocation information.
On Twitter and Instragram users can elect to include their latitude and longitude when tweeting or uploading an image respectively.
The Emotion Detection system used in the archive was originally developed for the We Feel project. It uses a large vocabulary of emotion terms that were compiled from multiple sources, including the ANEW and LIWC corpora, and a list of moods from LiveJournal. We conducted a crowdsourcing task (using Crowdflower) to organise these terms against Parrott's hierarchy of emotions.
Keywords are selected for each item of Social Media Content based on the frequency of words across the Archive. Hashtags are filtered out of the keyword selection process so they do not swamp the keyword results.
Hashtags are much simpler to extract than Emotions or Keywords. To extract hashtags the system just looks for words beginning with a number sign or hash character (#). As noted in the Keyword discussion, hashtags are filtered out of Social Media content text when selecting keywords.
The SLNSW Social Media Archive is designed to only show aggregate data. It allows users to view trends in discussion in Social Media but does not support viewing content at the level of individual posts.
Aggregate figures are only produced when the number of Social media items matching a set of conditions is above a threshold. If the number of items matching the conditions is below this threshold then no aggregate counts will be provided.
The aggregate information displayed in the Social Media Archive website is also available via API for researchers.
In order to access the API you need to register for an account. To do this you will need to provide us an email address for us to associate with your requests to the API.
After registering an account you will be provided with an access token. This token enables you to login to the API.
Your account is limited to 100 requests every 15 minutes. This enables us to provide good consistent service to all our visitors and API users. If you exceed the request quota your account will be rate limited.
Content collection and processing initially happens on dedicated Vizie instances.
Vizie is our Social Media Monitoring platform which allows monitors to specify queries to search for and public Social Media accounts to subscribe to and then monitors Social Media for relevant content.
Content collected in Vizie is piped through to the Social Media Archive using Kafka.
The Search and Filtering capabilities of the Explorer interface are provided through ElasticSearch.
The SLNSW Social Media Archive website was developed using a variety of tools including:
Visualisations were created using the following tools:
API documentation was generated using: