-
My Journey Into Scraping Twitter and/or Reddit Data
As part of my digital humanities project, I am considering collecting and analysing tweets and/or reddit discussions on the #INeedMasculism #INeedMasculinism #INeedMasculismBecause hashtag(s). I am hoping to get some proof of concept so that I can employ this method or something similar in my research for my MRES next year. As this is a pretty new idea for me, I’m going to be documenting my journey here as a sort of resource journal.
What Data to Collect
Data I am likely to want in addition to the actual tweet content is username, time/date, retweets and mentions. I will probably also want to collect some data about each user - profile picture, age, gender and other details available on profiles. I am likely to want similar data for Reddit. Additionally, for Reddit I will also want to collect score, up/down votes and subreddit.
This data will then go into a spreadsheet - ideally a multidimensional db, but that might have to wait for another time. In the spreadsheet I will have a page of each submission thread and then an individual page for each submission id that aggregates all the comments. I’m going to have to think of some way to filter out quoted comments to avoid double up. I will also need a way to track which comment each user is replying to - although, this might be problematic in itself as users might be replying to multiple comments. Nonetheless, it seems on Reddit that most users are fairly consistent in posting methods.
A quick search of reddit turns up very little for both #INeedMasculinism and #INeedMasculism. However, there are quite a few threads/posts discussing #INeedMasculismBecause. INeedMasculism will be a good tag to do testing with due to the limited results.
Reddit Scraping
When I started looking into scraping Reddit data, I saw in this post a suggestion that the simplest way is to use PRAW. PRAW is a python module that interfaces with the Reddit API. The great thing about PRAW is that it is designed to stay within the guidelines of the Reddit API, so all I need to do is supply a useragent string in line with the guidelines and after that I don’t have to worry about any issues with authentication or getting booted from the API for abuse.
About PRAW: http://praw.readthedocs.io/en/stable/
It doesn’t seem too difficult. I think getting my hands dirty combined with my very limited knowledge of python I might actually be able to get somewhere with the Reddit API and PRAW.
This was my first test run - getting the top 5 hot threads on the opensource subreddit.
import praw r = praw.Reddit(user_agent='praw testing') submissions = r.get_subreddit('opensource').get_hot(limit=5) for submission in submissions: print str(submission)
And the output:
49 :: Memes, etc. 32 :: 59 percent of tech hiring managers say they'll increase their open sour... 8 :: Why is Open Source built with closed tools? 9 :: Meet Kermit: A friendly web scraper written in coffeescript. Fun to use ... 14 :: Open source near ubiquitous in IoT, report finds
Success!
Searching for Comments on a Topic
submissions = r.search('INeedMasculism', subreddit=None, sort=None, syntax=None, period=None)
This gives a list of all threads using the search term. However, now I need to get the submission id from each in order to parse the comments We get that with
submission.id
For example:
import praw r = praw.Reddit(user_agent='praw testing') submissions = r.search('INeedMasculism', subreddit=None, sort=None, syntax=None, period=None) for submission in submissions: print str(submission.id)
Actually, on further investigation I find that I can just get it with
submission.comments
Now I have the comments objects. If I iterate along the comments objects with a for loop I can get the various variables available for each comment with something along the lines of
print(vars(comment))
I’m running it on INeedMasculism rather than any of the other variations as this only shows up one thread. For final run throughs I will run on all hashtag / terms by placing them in a tuple and iterating through them with a for loop
This is what I have so far:
import praw r = praw.Reddit(user_agent='praw testing') submissions = r.search('INeedMasculism', subreddit=None, sort=None, syntax=None, period=None) for submission in submissions: print 'Thread: ' + str(submission.id) + str(submission) for comment in submission.comments: print '' print 'Author: ' + str(comment.author) print 'Likes: ' + str(comment.likes) print 'Score: ' + str(comment.score) + ' (Ups: ' + str(comment.ups) + ' / Downs: ' + str(comment.downs) + ')' print 'Comment: ' + str(comment.body) # print(vars(comment))
Basically what I’m doing line by line.
- I import the PRAW python module
- I set my user agent string to something descriptive (when I flesh out the script this will change as Reddit asks nicely that we be descriptive with what we are doing with the API)
- Search for submissions that use my search term
- Iterate a for loop over each submission
- Print the name of the thread and its id to the terminal
- Iterate a for loop over each comment in that particular submission
- Print to terminal the author, how many likes they have, the score includign up and down votes, and the coment body
- Finally I have commented out the way that I can get the various vars available so that I can search for other info that I might want to include
This yields the following results:
Thread: 26plsq6 :: Anyone remember #INeedMasculism? Author: cjt09 Likes: None Score: 7 (Ups: 7 / Downs: 0) Comment: Something that a lot of people don't realize is that there *is* currently a lack of spaces in societies where people can discuss men's issues. What's worse, is when people do try to create these spaces, they're stigmatized and--dare I say it--oppressed. People should be able to express themselves, and [NeuroticIntrovert has a great post that goes into much greater detail here](http://np.reddit.com/r/changemyview/comments/1jt1u5/cmv_i_think_that_mens_rights_issues_are_the/cbi2m7a), where he steps through how this became such an issue today. We should really be promoting an environment that allows *everyone* to express their own views and perspectives. #YesAllWomen is a great movement that carries enough social capital that a lot of women can express the frustrations and injustices that they experience every day. #YesAllMen is important for the same reason--we can't work together on a solution unless we know where each other is coming from. Author: snwbrdbum14 Likes: None Score: 7 (Ups: 7 / Downs: 0) Comment: You can't put something in perspective for someone who lacks perspective altogether
Next week I will be looking at ways that I can insert this info into a database. Feeling on track! Hitting those deliverables!
Putting it in a Database (i.e. Excel <rolleyes>)
The simpliest method I found for creating a spreadsheet for the data was to use the xlwt python module
sudo pip install xlwt
Using Excel spreadsheets is probably not the best method, but it is certainly the simplest for now. I’m not doing anything advanced (formulas formating links) - I just need something that I can dump the data into and a CSV file won’t suffice as many of the comments will likely contain various characters such as commas and I’m not really interested in going to the effort of stripping them out, plus I’m not exactly sure of the methods I will be using to analyze the data yet, so just want to make sure I’m collecting the data in the rawest form possible. I can just use libre office to convert to a different format if necessary.
I used the basic examples here: http://www.saltycrane.com/blog/2010/02/using-python-write-excel-openoffice-calc-spreadsheet-ubuntu-linux/
import xlwt DATA = (("The Essential Calvin and Hobbes", 1988,), ("The Authoritative Calvin and Hobbes", 1990,), ("The Indispensable Calvin and Hobbes", 1992,), ("Attack of the Deranged Mutant Killer Monster Snow Goons", 1992,), ("The Days Are Just Packed", 1993,), ("Homicidal Psycho Jungle Cat", 1994,), ("There's Treasure Everywhere", 1996,), ("It's a Magical World", 1996,),) wb = xlwt.Workbook() ws = wb.add_sheet("My Sheet") for i, row in enumerate(DATA): for j, col in enumerate(row): ws.write(i, j, col) ws.col(0).width = 256 * max([len(row[0]) for row in DATA]) wb.save("myworkbook.xls")
Fairly straightforward: each row of data goes in a tuple in the order of the columns, then each row tuple goes inside a larger tuple. Then you just iterate over the tuple with a for loop and write the data with ws.write into the coordinates. Enumerate just lets you have an additional variable that holds the position within the for loop as an integer - i is holding the y coordinate (rows) and j is the x coordinate (columns).
The ws.col(0).width is an attempt to autosize the width of columns, but I’m not really interested in doing that so I scrapped it as it was causing problems when there were empty datasets.
So what I did was just use for loops and add the submissions to a tuple.
for searchterm in SEARCH_TERMS: print 'Searching for term: ' + str(searchterm) submissions = r.search(searchterm, subreddit=None, sort=None, syntax=None, period=None) for submission in submissions: # Make sure the submission isn't a bot copying another submission title = str(submission.title) if title.startswith('[COPY]'): print 'Submission COPY - Skipping...' else: submission_db = () # Make a separate db page for submission details only try: submission_db = submission_db + (str(submission.id),) submission_db = submission_db + (str(submission.title),) submission_db = submission_db + (str(submission.author),) submission_db = submission_db + (str(submission.url),) submission_db = submission_db + (str(datetime.datetime.fromtimestamp(int(submission.created_utc)).strftime('%Y-%m-%d %H:%M:%S')),) submission_db = submission_db + (str(submission.subreddit),) submission_db = submission_db + (str(submission.subreddit_id),) submission_db = submission_db + (str(submission.score),) submission_db = submission_db + (str(submission.selftext),) submissions_db = submissions_db + (submission_db,) print 'Added Submission ID: ' + str(submission.id) except: print 'ERROR: Adding info to submission_db' if 'submissions' in str(sys.argv) or len(sys.argv) == 1: print 'WRITING SUBMISSIONS' submissions_ws = wb.add_sheet("Submissions") for i, row in enumerate(submissions_db): for j, col in enumerate(row): submissions_ws.write(i, j, col) wb.save("myworkbook.xls")
I added some categories at the top of the spreadsheet in the first row before I run the search for submissions.
SUBMISSIONS_DB_CAT = ('ID', 'TITLE', 'AUTHOR', 'URL', 'TIMESTAMP', 'SUBREDDIT', 'SUBREDDIT ID', 'SCORE', 'COMMENT') submissions_db = () submissions_db = submissions_db + (SUBMISSIONS_DB_CAT,)
Debugging
Initially everything was working fine. I had expanded my code out to also collect comments for each submission and place them in individual pages of the spreadsheet according to submission ID. The code also aggregated all the comments for all submissions into a separate page. I also noticed that the submission also acted as the first comment, so I needed to add those details in the first row of each comments page. Now that everything seemed to be working fine, I decided to expand my code out to be able to collect multiple search terms. This was when I started running into ASCII UnicodeErrors. Googling the issue I found I needed to set encoding/decoding to utf-8. However, all the solutions I found I would need to do this for every individual string and I really couldn’t be bothered. There must be a quicker way! After some lengthy googling I stumbled upon this solution in a stackoverflow comment thread:
import sys reload(sys) sys.setdefaultencoding('utf-8') wb = xlwt.Workbook(encoding='utf-8')
It worked, but xlwt started throwing out errors. I found I also needed to set the encoding for the Workbook and all was well again!
After scraping some data, I noticed a few duplicate submissions in the POLITICS subreddit. It seems that a few bots were making copies of threads and dumping them into the subreddit. Copy threads are marked with [COPY] in the title, so I wrote a simple if statement routine to check submissions for this string in the title and skip them.
As I was going through the data, I started noticing something strange: all the up vote scores were equal to the total score. Moreover, many of the up votes were negative which simply cannot be possible. I then noticed the downvotes were all zero. After some searching in the Redditdev subreddit, I discovered that Reddit had recently made a (contentious) decision to remove all access to up/down vote scores and also upvote ratio data. This had been introduced to stop bots from attempting to artificially inflate/deflate some scores. Without access to this through the API, bots are unable to check whether or not they are having an effect on scores. Disappointing for me as a zero score with 100 votes is very different to a zero score with no votes in terms of impact on other users. Still, if the scores are being altered by bots, then they aren’t really going to be a very good heuristic anyway!
Another thing I noticed from putting the data into a spreadsheet was that when I compared back to the original threads I noticed that I seemed to be missing a number of comments. I confirmed this by adding a comment counter and comparing the number of comments between threads and what I was collecting. I then realised I was only getting the first level of comments, not the comments that replied to other comments. Each comment object also has a replies object! I solved this by writing a function to grab the comments data and then check for a replies object and recursively run that function on the replies object - so essentially the script would walk down the comments tree. The other option was to use a flatten comments option in PRAW. However, I wanted to make sure I retained the order of comments. The other issue I came up against was that some comment replies are hidden behind a “More Comments” object. Currently I’m just checking for this and skipping them as I haven’t come up against many, but this will ultimately need a solution. From what I’ve read, there are some limitations on how walking down the more comments objects - we will see!
PostGreSQL
Now I’m on to adding to a real database rather than a spreadsheet. I rewrote my code a little bit to use dictionaries rather than tuples
def addCommentRegression(submission,comment): global submission_db_dict global comment_db_dict # Make sure it is a comment rather than morecomments object if isinstance(comment, praw.objects.Comment): try: comment_dict = {} comment_dict['SubmissionID'] = submission.id comment_dict['ParentID'] = comment.parent_id comment_dict['Author'] = comment.author comment_dict['Created'] = datetime.datetime.fromtimestamp(int(comment.created_utc)).strftime('%Y-%m-%d %H:%M:%S') comment_dict['Score'] = comment.score comment_dict['Removal_Reason'] = comment.removal_reason comment_dict['Report_Reasons'] = comment.report_reasons comment_dict['Edited'] = comment.edited comment_dict['Controversial'] = comment.controversiality comment_dict['Body'] = comment.body addAuthor(comment.author) comment_db_dict[comment.id] = comment_dict print 'Added comment ID: ' + str(comment.id) except Exception as e: print(e) # Regression for comments if comment.replies: for reply in comment.replies: addCommentRegression(submission,reply) else: print 'More Comments OBJECT'
I’m using the tutorial here: http://zetcode.com/db/postgresqlpythontutorial/
Essentially I am collecting a submissions table, a comments table and an authors table. I will also need to add a subreddit table, after I look into collecting data on subreddits
con = None try: con = psycopg2.connect(database=POSTGRES_DB, user=POSTGRES_USER) cur = con.cursor() # ADD SUBMISSIONS table = () for key, value in submission_db_dict.iteritems(): entry = (str(key), str(value['Author']), str(value['Created']), int(value['Score']), str(value['Selftext']), str(value['SubredditID']), str(value['Title']) ) table = table + (entry,) cur.execute("DROP TABLE IF EXISTS Submissions") cur.execute("CREATE TABLE Submissions(SubmissionID TEXT PRIMARY KEY, Author TEXT, Created TEXT, Score INT, Selftext TEXT, SubredditID TEXT, Title TEXT)") query = "INSERT INTO Submissions (SubmissionID, Author, Created, Score, Selftext, SubredditID, Title) VALUES (%s, %s, %s, %s, %s, %s, %s)" cur.executemany(query, table) con.commit() except psycopg2.DatabaseError, e: if con: con.rollback() print 'Error %s' % e sys.exit(1) finally: if con: con.close()
It seems to be all working correctly as far as I can tell.
Debugging Foriegn Keys and Author 404s
I encountered a database error when referencing parent_ids back to the comments table. I quickly realised that some parent_ids referenced the submissions rather than comments. Basically they were split over two tables. The way forward (after discussing with Brian - one of my DH teachers) was to use a post supertype table that holds all common data and then have both comments and submissions tables primary key also be a foreign key that references this posts table. After doing this I quickly discovered a mismatch between some additional information added to parent id keys so they wouldn’t properly match - I solved this by adding an addition variable passed through in my addComments function that would pass through the previous comment/submission id from the last position in the tree.
I also had a problem with some authors throwing out HTTP errors - through debugging I discovered which authors were throwing these errors out. I then looked these authors api urls up and noticed that these returned 404 errors. Through some deft google searching I found discussion that noted that these occur when users are shadow banned. I had to add some special error handling for these issues.
Initially, I was collecting authors whenever they came up for a post. In order to avoid collecting author data multiple times I would check if the author object already existed in the tuple and then add it if it didn’t. Then later in my code I can iterate over the tuple for each author and add the data to my author dictionary - this way I would only be looking up the author data once per author rather than overwriting it.
HOWEVER, because of how praw handles data look-ups, it will pass an author object but doesn’t actually look it up until it is referenced in code - for example, when I’m checking if its in the tuple. At that point I get the 404 error, so the problem is that the author doesn’t get added to my author table causing foreign key problems. I don’t just want to assign these as None authors as they do exist. So what I did was I would check if the author exists in the tuple and add it. If I get an exception thrown then I know that the author is either a None author or a 404 (or some other error). So then I check if the str() of the author object exists and if it doesn’t add it to the tuple. If that fails then I know that there is no author name. Later when doing my author look ups I first try for looking up the author object - if that fails then we know it is either a look-up error or a None type author. So then I try to add it to the dictionary as a string - a None type object will fail as it is not a string. So finally at the end I also make sure to add a None type object to the author table. For the authors that have a 404 error, I also collect this data as it is the only data other than name that I have about them. This may actually prove useful data as I can compare authors comments who have been banned and who have not.
The code to deal with authors:
def addAuthor(author): global author_collection try: if author not in author_collection: author_collection = author_collection + (author,) print 'Added Author: ' + str(author) except Exception as e: print str(e) + ' for author ' + str(author) + ' adding as string instead of object' try: if str(author) not in author_collection: author_collection = author_collection + (str(author),) print 'Added Shadow_Banned Author as String: ' + str(author) except Exception as e: print str(e) + ' Failed to add author as string instead of object' for author in author_collection: print 'Collecting Author' + str(author) try: if author.name: author_dict = {} author_dict['Created'] = datetime.datetime.fromtimestamp(int(author.created_utc)).strftime('%Y-%m-%d') author_dict['Comment_Karma'] = author.comment_karma author_dict['Link_Karma'] = author.link_karma author_dict['Is_Mod'] = author.is_mod author_dict['Is_404'] = 'False' author_db_dict[str(author.name)] = author_dict except Exception as e: print(e) try: author_db_dict[str(author)] = {'Created': '', 'Comment_Karma': '0', 'Link_Karma': '0', 'Is_Mod': '', 'Is_404': 'True'} except Exception as e: print(e) author_db_dict['None'] = {'Created': '', 'Comment_Karma': '0', 'Link_Karma': '0', 'Is_Mod': '', 'Is_404': 'False'}
An Aside on Twitter Scraping for Future Reference
At first I had been struggling to find some information on how I would go about collecting data. I found some information about topsy.com, a service run by Apple that collected tweets and allowed end users to search that data in a variety of ways. Unfortunately, the service has now been shutdown. Nonetheless, pressing on further I was able to find some information. I’ve been collecting the links I’ve found in the section at the bottom of this post.
According to the article on Knightlab, the twitter API imposes some limits on how many calls can be made within a certain window. The Knightlab article suggests a few good tips about setting up a cron job to fetch data every 15mins and ensuring a key cycle between keys to stay within limits. I might need to set up a NeCTAR virtual machine with a cron job to do this for me and not have to keep a local machine running. From this StackOverflow forum thread, it seems like the twitter API might not be too difficult to use as it appears to just involve JSON-RPC requests. It might actually be fairly easy to just write a quick python script to collect the data and place it into a database. This would certainly simplify running on a virtual NeCTAR machine. However, if something already exists I would rather use that.
A Collection of Links
https://www.researchgate.net/post/What_is_best_way_to_collect_data_from_Twitter
https://www.entrepreneur.com/article/242830
http://stackoverflow.com/questions/2714471/twitter-api-display-all-tweets-with-a-certain-hashtag
https://www.reddit.com/r/TheoryOfReddit/comments/2hg53b/q_how_can_i_collect_raw_data_from_reddit/
https://www.reddit.com/r/redditdev
https://www.reddit.com/dev/api
-
When Worlds Collide:
Love, Intimacy and the Public Sphere
-
Trolling the Public Sphere:
Populist ElitesInternet anonymity provides democratic potential through allowing a greater sense of openness, equality, and freedom to express controversial view points. This allows for an ‘unprecedented’ wider public sphere participation from individuals who otherwise have ‘had little opportunity to participate in public debate’ (Rowland 2006: 519). Anonymity means that views can only be considered in terms of rational content: neither property nor privilege can be used to influence weight (Rowland 2006: 532). However, anonymity also allows for disassociation from individual identity and the creation or appropriation of other identities. It is frequently assumed that anonymity breeds incivility, damaging the Internet’s real potential for effective public sphere discourse. In the recent flurry of moral panic over Internet anonymity, trolling has become the mass media’s catch-all term for any unpleasant, hostile, or offensive comments made anonymously on the Internet. However, these attacks on trolling are perhaps misguided. This paper will delineate and examine some potential implications trolling has for effective public sphere discourse online.
The public sphere acts as a mediating social space ‘between the state and society’ (Delanty 2007: 3721). If private liberty is to be valued, then society must permit a plurality of heterogenous views. However, heterogeneity increases opportunities for marginalization of non-normative views, problematising democracy as ‘collective self-government’ (Peters 2008: 39). Ideally, for government to be truly democratic, all individuals must voluntarily recognise and accept the decisions made (Peters 2008). Peters (2008) suggests that through communications of a certain type, the public sphere can reach this position voluntarily. Habermas (1996: 361-2) states that the public sphere does not enact policy changes, rather, it serves to influence the actions of political institutions which are legitimated through serving the democratic common interest of all. Thus, the basic function of the public sphere is to rationally negotiate and locate the common ground for the ‘practical problems of collective life’ that require or are subject to policy (Peters 2008: 37). Through this expression of solidarity, pressure is exerted on political institutions to legitimate policy by making changes that reflect the common interest.
It is important to note that public sphere discourse does not only exert influence on the formal systems of the state, but also upon the informal aspects of everyday society (Peters 2008). A second-person perspective must be taken in order to attempt to frame propositional attitudes in terms of common interest claims and understand the common interest claims of others (Habermas 1996). This unique perspective creates a relationship to the common ground of ‘both nearness and remoteness simultaneously’ (Simmel 1971/1908: 147). As a result, the varying viewpoints create a Simmelian strangeness which forces exposure and reassessment of latent dogmatic assumptions.
For Peters (2008: 37-8), effective public sphere discourse requires three fundamental features: equality and reciprocity; openness and adequate capacity; and a specific discursive structure. To represent the common interest of all, each agent ‘capable of expressing themselves in public’ must be equally free to do so (Peters 2008: 37). Furthermore, each view must be afforded equal consideration. This does not mean each view has equal value, rather that the value of each view is determined only in accordance with its rational weight. If each agent is to have an equal opportunity to both speak and be heard, then they must also reciprocate this right by listening equally to every other agent. Therefore, all views must be openly accepted and only through public debate can their validity be determined. This assumes that public sphere participants have an adequate capacity to rationally consider and equally debate the common interest of each view. The debate must follow a discursive structure. All views, whether disagreements or proposals, must be expressed as falsifiable arguments, rather than opinions, that demonstrate the link to the common ground between all agents. To preserve rational integrity, logical fallacies, such as ad hominem attacks, cannot be allowed. There must then be a ‘mutual respect’ that this rational integrity will be upheld (Peters 2008: 37-8).
Habermas (1996: 367) states that associations help to ‘distill’ the common interest of their members into a form suitable for effective public sphere discourse. However, it is perhaps a daunting task to try to locate and unite a substantial number of likeminded individuals who are dispersed external to personal milieux (Tocqueville 1999/1840: 92-4). Tocqueville (1999/1840: 92-4) argues that the utility of newspapers is of fundamental importance in achieving this goal. Newspapers give the individual the power to present a thought to large number of people simultaneously. In short, newspapers allow likeminded individuals to reach out to each other and realise the interconnect of their common interests so as to form associations (Tocqueville 1999/1840: 92-4). However, tendencies towards political parallelism and clientelism in the Australian media negatively affect its political functions both as marketplace of ideas and fourth estate providing objective and unbiased critique on politically important issues (Jones and Pusey 2010: 456-7). Nevertheless, new media forms, such as the Internet, offer associational utility in ways Tocqueville perhaps could not have imagined. Moreover, it is frequently hoped that the Internet can be utilised to overcome many of the deficiencies and shortcomings of traditional news media, and ‘herald new possibilities for political participation’ (Bohman 2004: 131).
The mass media use of the term “trolling” has come to include, arguably incorrectly, direct threats, intimidation, malice, aggression, and deliberately abusive cyber-bullying (for instance, see Hildebrand & Matheson 2012; The Times 2012; Herald Sun 2012). For the purposes here, I wish to draw a distinction between trolling and these other behaviours. This is not to discount the problematic nature of abusive comments that have seemingly become commonplace on much of the Internet. Nor is this distinction to ignore the significant overlap that often occurs between trolling and these other behaviours. Rather, this distinction is to acknowledge the considerably obvious detriment these other behaviours have for rational discourse both online and offline. In logic terms, these can be dismissed as ad hominem attacks that lack any argumentative purpose. What I wish to explore here is whether “proper” trolling can have any relevance or value to rational and communicative public sphere discourse.
The best way to understand the definition of “proper” trolling, is perhaps by comparison with flaming. Flaming can be defined as making a deliberately offensive or inflammatory remark directed towards a person or group, usually as ‘an impulsively angry response to a previous message or a perceived breach of Internet etiquette’, in which satisfaction is derived directly from making the remark (OED Online 2012a). For instance, an offline equivalent is when I yell an insult at a driver who cuts me off in traffic. This is by no means a comprehensive definition. However, it serves the purpose for the distinction I am attempting to make. Trolling, on the other hand, involves making a remark designed to incite a response. For something to be a troll it must involve a bait. This usage originates in the fishing terminology that means ‘to trail a baited line behind a boat’ (OED Online 2012b). Hence, to troll is ‘to post a deliberately erroneous or antagonistic message … with the intention of eliciting a hostile or corrective response’ (OED Online 2012b). In this sense, trolling is closely connected to flaming in that trolling often provokes “flame wars”. It is the motive of inciting a response which differentiates trolling from flaming: ‘a troll who gets no response has failed’ (Hardaker 2010: 232). The question then is what, if any, latent democratic potential exists in this motive.
On the surface, trolling seems to lack near all of the necessary features for public sphere discourse as outlined by Peters (2008). The very purpose of trolling seems to be to derail the discursive structure of rational debate. Trolls, by definition, intentionally use logical fallacies to incite responses. Their arguments are not presented as falsifiable, rather they are false to begin with. In this way, trolls cannot be seen to be extending the mutual respect the discursive structure requires. Additionally, the intentional irrationality seemingly indicates that there is no reciprocity of equal consideration to claims on the basis of rationality. There is then no demonstration of adequate capacity. Moreover, trolling undermines the mutual respect which preserves rational integrity. The very goal of trolling is to escalate arguments into irrational flaming through antagonistic behaviour whilst appearing sincere. Donath (1999: 45) states that ‘the troll attempts to pass as a legitimate participant, sharing the group’s common interests and concerns’. Dahlberg (2001: n.p.) suggests that this ‘identity deception’ negatively affects trust within the group, causing participants to be suspicious of others and wary of putting their own views forward. In this sense, whilst not actively denying participation, trolling can have a negative effect on freedom to express individual views. Nonetheless, trolling cannot be dismissed entirely on face-value.
Papacharissi (2002: 23) states that online interactions are often dominated by only a small percentage of actual users. Furthermore, there is a degree of Internet access stratification whereby certain people could be precluded from online interaction due to a number of reasons such as technical proficiency, age, disability, income and so on (Buchstein 1997; Papacharissi 2002: 21). Ruiz et al. (2011: 475) observe that on many news websites, users rarely comment more than once and so are not really participating in the necessary back and forth that public discourse requires. Moreover, Papacharissi (2002: 17) argues that making a single political comment on a website often gives users a false sense that they have participated in actual public discourse. Trolling can provoke users to engage in further interactions which force them to back up their claims with arguments demonstrating the common interest. Furthermore, as trolls often take counter positions, they then could inadvertently voice otherwise unspoken concerns.
Internet anonymity allows for disassociation from individual identity and the creation or assumption of another (Rowland 2006). Rowland (2006: 533) states that this sense of ‘deindividuation’ can allow individuals to ‘act in a disinhibited fashion’ and ‘exhibit extremes of behaviour’. Massaro and Stryker (2012: 418) point to a study which suggests ‘that selective exposure to ideologically extreme positions can, in fact, produce extremism’. Thus, trolling often has a self-feeding effect which “fans the flames”, creating increasingly polarised extremist positions among participants. Trolling proves problematic as it easily provokes a continual divergence away from seeking the common ground and towards incivility. Moreover, Papacharissi (2002: 22) notes how anonymity manifests as a lack of accountability. Even if trolling somehow avoided extremism and found the common ground, that common ground would lack legitimacy as one of the vertices of difference across which it is stretched is nonexistent in terms of accountability.
Nonetheless, Massaro and Stryker (2012: 419-20) argue that true anonymity on the Internet is a rare occurrence. Most websites require users to provide certain details to create accounts and users can often be identified according to the IP address assigned to them by their Internet Service Provider. Moreover, a combination of automatic moderation software and real-life moderators allows websites to monitor for comments that attempt to derail discussion or involve incivility (Massaro & Stryker 2012). Ruiz et al. (2011) also note that many news websites allow for user moderation where users can report comments that breach the terms of service. Ruiz et al. (2011) observe that news website terms of service generally seek to encourage intelligent debate and discussion via principles similar to those outlined by Peters (2008) for effective public sphere discourse. If the purpose of trolling comes from inciting a response, then moderation forces trolls to become more enterprising and creative so as not to be detected. It could come to a point where, for trolls to be successful, they must actually be part of civilised political debate. The fundamental difference being that trolls seek to play devil’s advocate rather than expressing authentic claims to common interest.
The lack of authenticity and accountability in the troll’s ersatz position can perhaps prove very problematic for a public sphere seeking to influence policy. If the trolls do not represent real claims, then there must be an identical authentic claim to fill the void left by the troll. However, this is a problem that perhaps applies equally to anonymous online discourse in general rather than being limited specifically to trolling. It is frequently argued that the Internet allows people to try on a number of fragmented virtual selves and liquid identities (Burkart 2010; Turkle 1995). Furthermore, Miller (2011: 177) points to Goffman’s (1959) dramaturgy metaphor to suggest that the self others see, regardless of whether it is online or offline, is always a performance that involves some sense of withholding a back-room identity. Miller (2011: 176-80) argues that the offline/online divide cannot provide a simple distinction of more and less authentic selves. In this regard, anonymous online performances can potentially be equally representative of a stranger’s true self as the impression given in an encounter with the same person on the street.
Massaro and Stryker (2012: 414-9) state that the available research is conflicted on whether Internet use increases homogeneity or heterogeneity of political views users are exposed to. They note that, in general, individuals find divergent viewpoints threatening. Rather than expend the extra cognitive energy required to process different views, individuals instead seek simplicity. However, this does not mean individuals necessarily avoid opposing views or disengage when exposed to them, only that they are unlikely to seek them out. Massaro and Stryker (2012: 415-9) contend that the wide range of ‘discourse options’ on the Internet provides greater opportunity for users to seek digital enclaves where their biases are reproduced. Nonetheless, this diversity also creates the possibility of inadvertent exposure to challenging viewpoints (Massaro and Stryker 2012: 419). Trolls add to this diversity by intentionally seeking to challenge other views.
It is possible that trolls could infiltrate homogenous, populist type groups and internally provoke them to examine dogmatic biases. Populism is an anti-elitist, exclusionary, “we the people” type politics which frequently lays blame for its troubles on weaker minority groups (Wells 1997; Berlet & Lyons 2000). Populism misappropriates free-speech rights to give value to its immediate “gut-feeling” opinions whilst dismissing any criticism as undemocratic and elitist (Wells 1997). Instead of using rational argument, populism uses loudness of voice to discount other views. Populism’s lack of accountability is dangerously undemocratic as it does not reciprocate equal consideration (Wells 1997). Thus, populism undermines public sphere legitimacy by refusing to follow the rules.
Through constructing an identity of seemingly sincere participation, the troll becomes part of the in-group. As a result, trolls could act in a way that provokes debate inside populist minded groups. That is, whilst trolling may not necessarily fulfil the requirements for effective public sphere discourse, it may still provide democratic potential through infiltrating other, politically influential, populist type groups who also refuse to follow the rules of public sphere discourse. By infiltrating the group, trolls avoid the “us and them” type problems where challenging views are dismissed as outsider elitism. Trolls could provoke populist groups to rationally legitimate the common interest of their own claims. By doing so, the claims are unwittingly transformed into following the discursive structure of the communicative public sphere. In a similar way, trolling could also subvert environments of political parallelism. However, trolling may also have the opposite effect whereby, as people become more aware of trolling, the label of “troll” could be used to dismiss authentic in-group dissent. Nonetheless, the most effective and successful trolls are those which avoid detection altogether. So the best trolls are cunning enough to avoid any relegation to out-group status.
Outlined here are some potential implications, both positive and negative, trolling has for Internet public sphere discourse. Research into this area is particularly scarce and much more is needed in order to draw more concrete conclusions. If my argument has been successful, what has been demonstrated here is that trolling should not be immediately dismissed as entirely uncivil and lacking any democratic potential. Rather, the very nature of seeking a response and initiating actual dialogue is something that is conspicuously absent on many parts of the Internet. This is definitely not to suggest that trolling is some saintly panacea. Only that the mass-media focus on trolling as the undemocratic root of all evil is misguided and far overstated. Instead of scapegoating trolling behaviours, more energy should be expended on ways to promote rational public sphere discourse and strategies to deal with anonymously abusive Internet users.
References
Berlet, C. & Lyons, M.N. (2000), ‘Introduction’, in D. Kellner (ed.), Right-Wing Populism in America: Too Close For Comfort, New York, The Guilford Press: pp. 1-18.
Bohman, J. (2004), ‘Expanding Dialogue: The Internet, the public sphere and prospects for transnational democracy’, The Sociological Review, 52: 131-155.
Buchstein, H. (1997), ‘Bytes That Bite: The Internet and deliberative democracy’, Constellations, 4(2): 248-263.
Burkart, G. (2010), ‘When Privacy Goes Public: New media and the transformation of the culture of confession’, in H. Blatterer, P. Johnson and M. Markus (eds.), Modern Privacy: Shifting Boundaries, New Forms, New York, Palgrave MacMillan: pp. 23-38.
Dahlberg, L. (2001), ‘Computer-Mediated Communication and The Public Sphere: A critical analysis’, Journal of Computer-Mediated Communication, 7(1): n.p.
Delanty, G. (2007), ‘Public Sphere’, in G. Ritzer (ed.) The Blackwell Encyclopedia of Sociology, Massachusetts, Blackwell Publishing: pp. 3721-2.
Donath, J.S. (1999), ‘Identity and Deception in the Virtual Community’, in M.A. Smith and P. Kollock (eds.), Communities in Cyberspace, London, Routledge: pp. 29-59.
Goffman, E. (1959), ‘Performances’, The Presentation of Self in Everyday Life, New York, Anchor Books, Doubleday: pp. 22-30.
Hardaker, C. (2010), ‘Trolling in Asynchronous Computer-Mediated Communication: From user discussions to academic definitions’, Journal of Politeness Research, 6: 215-242.
Habermas, J. (1996), ‘Civil Society, Public Opinion and Communicative Power’, Between Facts and Norms: Towards a Discourse Theory of Law and Democracy, Cambridge Massachusetts, The MIT Press: pp. 359-387.
Herald Sun (2012), ‘Charlotte Dawson: How the cyber trolls beat me’, The Herald Sun, 3 Sept., at http://www.heraldsun.com.au/news/charlotte-dawson-how-the-cyber-trolls-beat-me/story-fnbk7kwa-1226463900647, accessed 10 Nov. 2012.
Hildebrand, J. & Matheson, M. (2012), ‘Twitter Makes Moves to Prevent Online Trolls’, The Daily Telegraph, 15 Sept., at http://www.dailytelegraph.com.au/news/sydney-news/twitter-moves-on-trolls/story-e6freuzi-1226474468650, accessed 10 Nov. 2012.
Jones, P.K. & Pusey, M. (2010), ‘Political Communication and ‘Media System’: The Australian canary’, Media Culture Society, 32(3): 451-471.
Massaro, T.M. & Stryker, R. (2012), ‘Freedom of Speech, Liberal Democracy, and Emerging Evidence on Civility and Effective Democratic Engagement’, Arizona Law Review, 54: 375-442.
Miller, D. (2011), ‘Fifteen Theses on What Facebook Might Be’, Tales from Facebook, Cambridge, Polity Press: pp. 164-204.
OED Online (2012a), ‘Flaming, n.’, Oxford English Dictionary Online, Oxford University Press, at http://www.oed.com/view/Entry/71033, accessed 10 Nov. 2012.
OED Online (2012b), ‘Troll, v.’, Oxford English Dictionary Online, Oxford University Press, at http://www.oed.com/view/Entry/206615, accessed 10 Nov. 2012.
Papacharissi, Z. (2002), ‘The Virtual Sphere : The Internet as a public sphere’, New Media Society, 4(1): 9-27.
Peters, B. (2008), ‘The Meaning of the Public Sphere’, in H. Wessler (ed.), Public Deliberation and Public Culture: The Writings of Bernhard Peters 1993-2005, New York, Palgrave: pp. 36-42.
Rowland, D. (2006), ‘Griping, Bitching and Speaking Your Mind: Defamation and free expression on the Internet’, Penn Street Law Review, 110: 519-538.
Ruiz, C., Domingo, D., Micó, J.L., Díaz-Noci, J., Meso, K. & Masip, P. (2011), ‘Public Sphere 2.0? The democratic qualities of citizen debates in online newspapers’, The International Journal of Press/Politics, 16(4): 463-487.
Simmel, G. (1971/1908), ‘The Stranger’, in D. N. Levine (ed.), Georg Simmel: On Individuality and Social Forms, Selected Writings, Chicago, The University of Chicago Press: pp. 143-149.
The Times (2012), ‘Celebrities Advocate Abuse of Trolls’, The Sunday Times UK, 17 Sept., http://www.theaustralian.com.au/news/world/celebrities-advocate-abuse-of-trolls/story-fnb64oi6-1226475194355, accessed 10 Nov. 2012.
Tocqueville, A. de (1999/1840), Democracy in America, Volume II, Champaign Illinois, Project Gutenburg; Boulder Colorado, Net Library.
Turkle, S. (1995), ‘Aspects of the Self’, Life on the Screen: Identity in the Age of the Internet, New York, Simon & Schuster: pp. 177-209.
Wells, D. (1997), ‘One Nation and the Politics of Populism’, in G. Bligh (ed.), Pauline Hanson: One Nation and Australian politics, University of New England Press: pp. 18-28.
-
Why do Digital Humanities Projects often use Open Source Software Tools?
In The Magic Cauldron, Raymond uses rational choice theory to demonstrate why a business might find greater value in preferring open – rather than closed – source software implementations. These arguments for free open source software (FOSS) are built from Raymond’s previous discussion in The Cathedral and the Bazaar regarding how FOSS ideologies create an effective environment for software development. Many of the points raised by Raymond directly relate to why a researcher might prefer FOSS to solve domain problems in digital humanities projects.
A key point Raymond raises regards constructive laziness: considering that results trump effort, beginning from a partial solution is nearly always preferable to starting from scratch. As FOSS projects are free (as in speech, libre), they are readily available to be adapted and modified to any specific needs. Raymond observes that it is only through praxis that problems are truly understood: hence, it is likely that initial solutions will be inadequate and require modification or even complete rethinking. With proprietary software this means either being locked into a poor solution (because the code cannot be modified) or expending financial resources on new solutions. Free (as in beer, gratis) FOSS tools allow for rapid prototyping, thus rapid failures, and therefore rapidly better solutions without snowballing financial costs.
FOSS can also be used as a ‘strategic weapon’ (Raymond 2000). Free (libre) code means that similar projects are not forced to implement individual solutions in parallel, but instead can improve, modify and adapt existing project code, thus contributing back to the project. A development base built across a wide community allows for cost sharing. This also provides risk spreading by ensuring that even if the original developers leave, other developers can fill the gaps, thus preventing users from being left with projects reliant on orphaned software. This contrasts sharply with proprietary software where users are reliant on the company considering the software valuable enough to continue development and support. Furthermore, the free (gratis) nature of FOSS allows it to act as a market loss leader, preventing corporate price fixing and monopolistic control over how solutions should be implemented.
For Raymond, FOSS is particularly valuable when projects require independent peer review for verifying correctness of design and implementation (something all academic research requires). Free (libre) code translates to transparent code. The collaborative community aspects of FOSS combine with this transparency to increase accountability and replicability. Moreover, best practices arise more rapidly as other users see different approaches to problems, provide small suggestions and questions that lead to new ways of thinking, or see previously overlooked issues. Raymond (2000) summarises this effect as ‘many eyeballs tame complexity’.
FOSS is frequently criticised as having a free rider problem: why submit patches if doing so only benefits others? Firstly, as noted, academic research requires peer review: obfuscating methods only hinders this process. Moreover, beyond selecting appropriate approaches to research, it is rarely methods that make research valuable. Rather, it is how researchers analyse and interpret collected data so as to advance the existing knowledge in their field. Additionally, according to game theory, whilst intuitively it seems altruistic to submit code to FOSS projects, it is, in fact, optimally selfish. The cost has already been incurred in creating patches. By not submitting patches, the cost of maintaining patches for subsequent updates falls on the patch holder. Merging patches with source places that cost back onto the community of developers that maintain the project. Furthermore, submission of the patch fosters collaborative involvement amongst the community and therefore increases development.
Here I have outlined just a few of the many benefits that FOSS provides to academic researchers. However, despite the significant benefits, this is not to suggest FOSS is some utopian catchall solution. As with all tools, it is important to justify selection against the available alternatives so as to make the best choice for the problem at hand.
Reference
Raymond, E.S. 2000, The Cathedral and the Bazaar, http://www.catb.org/esr/writings/cathedral-bazaar/
subscribe via RSS