Analysing Social Science Data Using R

This is a webpage of the course “Analysing Social Science Data Using R” taking place in SNS in 2012.

Links to slides, data files, and R scripts will be available under each meeting section below.

A short description of the course

Data analysis is fundamental to making valid inferences from sociological research and research in other ares of the social sciences. The specific statistical and data-analytic methods and tools one chooses to employ are determined by the substantive questions and problems underlying and motivating the research, but one’s ability to apply the methods and tools to the data he or she has collected is dependent, at least in part, upon accessibility of software implementing them. While proprietary software, such as SPSS or Stata, is in common use in social science, the proposed course is designed to teach participants how to perform statistical analyses using R. R (http://www.r-project.org) is a free and open-source piece of software for statistical computing and data visualization. R is considered a “lingua franca” of data analysis among academics and professionals in many disciplines. It is growing in popularity among social scientists and in commercial environments. R is widely recognised for its power, unsurpassed data visualization capabilities, and flexibility, or ability to implement any statistical method or model that can be brought to bear.

The main objective of the course is to train the participants in using R for a typical analysis of data in sociology (or other social science disciplines), e.g.: basic manipulation of data (recoding, variable transformation), computing descriptive statistics, visualizing data, estimating regression models.

The course’s focus is how to use R rather than teaching statistical methods. Consequently, a primary requirement for the participants is familiarity with the basic concepts of statistics as it is commonly used in sociology and other social sciences. The course will provide necessary refreshment of some of the topics though.

The main way of interacting with R involves typing commands (not unlike using SPSS syntax) and, usually short, scripts that use R language. Consequently, the secondary requirement is being fairly comfortable with using a computer and being ready to learn R syntax.

Course details and grading

Course meetings will be held weekly with two academic hours per meeting. You are expected to attend all classes and to be prepared each day. Because your participation is a vital part of the work of the class, it is considered unprofessional to miss class and/or not to be prepared. Class participation will constitute 20% of your final grade.

During the course, you will be asked to work out several in-class exercises. You will also be given a few home assignments. The exercises and assignments constitute 40% of your final grade.

At the end of the semester, you will receive a problem to be solved using data-analytic methods and tools you have learnt during the course. This latter problem is equivalent to a course exam and will constitute the remaining 40% of your grade. Important note on exercises, assignments, and exam A missed exercise or assignment will be graded as a ZERO and no make-up will be scheduled unless: (1) prior arrangements have been approved by the Instructor, or (2) the absence is covered by a WRITTEN VALID EXCUSE.

Reading materials

The course is designed to be rather self-contained. However, should a need arise, the participants will be pointed to relevant chapters of:

  • Fox, J. and Weisberg, S. 2011. An R Companion to Applied Regression, 2nd Edition. Sage
  • Muenchen, R. 2009. R for SAS and SPSS Users. Springer
  • Agresti, A. and Finlay, B. Statistical Methods for Social Sciences, Prentice Hall

First two books are available in the library. Excerpts from the third will be provided.

Along with the books, we will make use of various manuals and articles available on the Web.

Data files

File Description
pgss1999in.tab Data from PGSS and ISSP 1999 containing a battery of questions on estimated and “fair” incomes for several social/occupational categories, plus some demopgraphic variables.
pgss1.sav Data from PGSS editions 1999 and 2008. Selected variables

Course schedule

Class 1 (13.02)

Introduction and organizational matters. Course objectives. Workflow. Grading. What is R? Downloading and installation. CRANs and contributed packages. R Studio (a user-friendly interface to R)

Introduction slides [PDF]

Class 2 (20.02)

Using R as a advanced calculator. Basic functions and operators. Using R help system. Creating simple objects. Workspace.

Basics [R script]

Class 3 (5.03)

Importing data into R from plain text files, SPSS files, Excel files. Working directory. Data frames: $, attach/detach, with, subset Basic descriptive statistics: computing means, variances, standard deviation, median, quantiles etc. Handling missing data.

Class 4 (19.03)

Basic data visualization: creating bar plots, pie charts, histograms, scatterplots. Exporting data from R: using R’s native data format: save and load

Class 5 (26.03)

Recapitulation on vectors and recoding.

  • Selecting elements from vectors using []
  • Replacing values using [] and replace
  • Useful functions in recoding: %in%, which.
  • Categorizing continuous variables.
  • Creating ranks.

Factors as a special type of vectors to represent categorical variables.

Files: Slides, R script, Solution to in-class exercise, Solution to homework

Class 6 (2.04)

Creating frequency tables and cross-tabulations. Creating functions. Performing Chi-square tests.

  • Creating cross-tabulations with table, flattening with ftable. Making tables of proportions with prop.table.
  • Creating functions.
  • Matrices and arrays: creating and subsetting. Manipulating matrices and arrays with apply, cbind, and rbind.
  • Performing Chi-square tests with chisq.test.

Files: Slides, R script, Data file, Homework, Answers

Class 7 (16.04)

Computing conditional descriptions, data aggregation and merging. Exporting data.

  • Computing conditional descriptive statistics using tapply.
  • Data aggregation with aggregate.
  • Exporting data to text files, and SPSS files.

Optional topics (if time permits):

  • Merging datasets with merge.
  • Data reshaping with melt and cast from “reshape” package.

Files: Slides, Script, Solutions and maps

Class 8 (23.04)

Advanced data visualization with R and creating custom plots.

  • Visualizing bivariate, trivariate, and multivariate data.
    • Categorical by continouous variable relationship with cdplot.
    • Visualizing several data series with matplot.
    • Visualizing pairwise relationships among multiple variables with scatterplot matrix.
    • Conditioning plots for three and four variables with coplot.
  • Creating custom plots
    • Using points, lines, axis, and legend
    • Color specification.

Files: Slides, Homework, R code used in examples in slides

Class 9 (07.05)

Linear regression in R (part 1).

  • Model specification and estimation
  • Summarizing the results with summary and anova
  • Visualization using model-predicted values with predict.
  • Using categorical predictors. Contrasts.

Files: Slides, Script, Data, Homework,Answers to the homework

Class 10 (14.05)

Linear regression in R (part 2): using interaction effects.

Slides, Script, Data

Class 11 (21.05)

Generalized linear models: Binary logistic regression.

Discussions of under-discussed course topics before the upcoming test.

Files: data, slides, script, discussion notes.

Class 12 (28.05)