SLNSW Social Media Archive

What is the SLNSW Social Media Archive?

The SLNSW Social Media Archive is a collaboration between the State Library of New South Wales and CSIRO Data 61. The SLNSW Social Media Archive is an archive of public Social Media content discussing life in New South Wales.

In the same way you can go into a library today and look through newspapers and other documentation from the early twentieth century for insights into life back then the SLNSW Social Media Archive seeks to create a collection of data that will allow people in the future to gain insights into how life is now.

The user interface was designed by Brian Jin, Stephen Wan, James McHugh and Cecile Paris at CSIRO Data61, in collaboration with Brendan Somes, Geoffrey Barker and Sean Volke from the State Library of NSW.

Media Coverage

Media Releases

Citation

Publications

Activities

Data Collection for the Social Media Archive is largely based on text queries, augmented with subscriptions to selected government or community run accounts of public interest.

The queries and subscriptions are curated by SLNSW staff.

The search queries and subscriptions are organised into Activities, each representing different aspects of life in New South Wales.

Activity Description
ARTS Posts related to the Arts.
BUSINESS Business in New South Wales.
EDUCATION New South Wales education system.
ENVIRONMENT New South Wales environmental discussion.
GOVERNMENT Leisure Activity in New South Wales.
INDIGENOUS Posts related to Indigenous issues and culture.
LEISURE Leisure Activity in New South Wales.
MEDIA New South Wales media.
POLITICS New South Wales politics.
SPORT AND RECREATION Sporting Clubs and Events.

What do the Mediatypes categories refer to?

After collection Social Media Content is classified by type of media.

Mediatypes allow us to place content into broad categories based on the type of media the content is.

The current mediatype categories are:

Mediatype Description
microblog Twitter, Google+: Posts on microblogging websites.
public pages Public pages, comments and other content.
picture Instagram and other picture sharing platforms.
news News media websites.
video Youtube and other video sharing websites.
blog Blogspot and Wordpress and other blogging websites.
gov Government websites.
forum Posts and threads on Internet Discussion forums.
question Yahoo Answers and other Question Answering websites.
comment Website comments.
misc Content that we could not determine a type for.

How does showing content on the map of New South Wales work?

There is a very small amount of Social Media content that includes unambiguous geolocation information.

On Twitter and Instragram users can elect to include their latitude and longitude when tweeting or uploading an image respectively.

Using Local Government Area boundary information from Data.gov.au we can determine which Local Government Area a latitude and longitude falls within.

How do you detect emotions within Social Media Content?

The Emotion Detection system used in the archive was originally developed for the We Feel project. It uses a large vocabulary of emotion terms that were compiled from multiple sources, including the ANEW and LIWC corpora, and a list of moods from LiveJournal. We conducted a crowdsourcing task (using Crowdflower) to organise these terms against Parrott's hierarchy of emotions.

How do you extract keywords from Social Media Content?

Keywords are selected for each item of Social Media Content based on the frequency of words across the Archive. Hashtags are filtered out of the keyword selection process so they do not swamp the keyword results.

How do you extract hashtags from the Social Media Content?

Hashtags are much simpler to extract than Emotions or Keywords. To extract hashtags the system just looks for words beginning with a number sign or hash character (#). As noted in the Keyword discussion, hashtags are filtered out of Social Media content text when selecting keywords.

Are there privacy concerns with the Social Media Archive?

The SLNSW Social Media Archive is designed to only show aggregate data. It allows users to view trends in discussion in Social Media but does not support viewing content at the level of individual posts.

Aggregate figures are only produced when the number of Social media items matching a set of conditions is above a threshold. If the number of items matching the conditions is below this threshold then no aggregate counts will be provided.

API

The aggregate information displayed in the Social Media Archive website is also available via API for researchers.

In order to access the API you need to register for an account. To do this you will need to provide us an email address for us to associate with your requests to the API.

After registering an account you will be provided with an access token. This token enables you to login to the API.

Your account is limited to 100 requests every 15 minutes. This enables us to provide good consistent service to all our visitors and API users. If you exceed the request quota your account will be rate limited.

Technical Information

Architecture

Content collection and processing initially happens on dedicated Vizie instances.

Vizie is our Social Media Monitoring platform which allows monitors to specify queries to search for and public Social Media accounts to subscribe to and then monitors Social Media for relevant content.

Content collected in Vizie is piped through to the Social Media Archive using Kafka.

The Search and Filtering capabilities of the Explorer interface are provided through ElasticSearch.

How was the web interface constructed?

The SLNSW Social Media Archive website was developed using a variety of tools including:

Visualisations were created using the following tools:

API documentation was generated using:

NSW Local Government Area Administrative Boundaries came from Data.gov.au