2012. augusztus 15.

Statistics and R for Linguists - a reading list

Linguists and students of linguistics can find themselves in tough situations since statistics and computing are usually not an integral part of their education, or they only get a crash course on these subjects. I think R is good choice for self study because of the available high quality books and other resources.

The basics
  • Verzani: simpleR - short, free and very beginner friendly intro into R and stats. The author wrote Using R for Introductory Statistics which is a more elaborated version of simpleR.
  • Udacity Intro to Statistics (ST101) - Sebastian Thrun's free Udacity course is a general introduction to statistics. It is very nice and you can work on the course at your own speed. It contains optional material on statistical programming with Python.
  • Coursera: Statistics One - Andrew Conway's Princeton course is now on Coursera. Just like the Udacity course, it is a general intro to stats, so it doesn't deal with linguistics applications but it will also provide an intro to R
  • Coursera: Computing for Data Analysis - Roger D. Peng's introductory course at John Hopkins is now adapted to Coursera. According to the course description, this course is focusing on using R for basic tasks.
Linguistics and statistics

Lx and R
  • Gries: Quantitative Corpus Linguistics with R - Gries' book is well written and assumes no prior knowledge of stats, R and corpus linguistics. I love this title as it simply does its job and teaches the basics of manipulating and analyzing linguistic data with R.
  • Gries: Statistics for Linguistics - a general intro into stats and R for linguists. It is very readable and i think it is perfect for self-study.
  • Baayen: Analyzing Linguistic Data - this book can be a tremendous resource for upper-level undergrads and graduate students. It assumes that you are not new to linguistics and shows you how to use R in your investigations. However it goes through its topics very fast and sometimes gives very limited explanations to the R code fragments. I don't think this title is an ideal first book on statistical methods and R.
  • Johnson: Quantitative Methods in Linguistics - this book is about quantitative methods and it is using data sets drawn from various fields of linguistics. This is the more advanced book in this list, it is more about the methods, and less about programming in R, so don't pick it up as you first book in this subject. Later, when you have some background in stats and R, you can learn a lot from this title.
  • Vasishth - Broe: The Foundations of Statistics - this book is using a simulation based approach to teach statistics (and R). I think this book is targeted towards psycholinguists, but others also can find it useful as it assumes no knowledge of stats and/or R.
  • Levy: Probabilistic Models in the Study of Language - a work in progress textbook. It is free, but the author hasn't finished it yet. As the intended audience is graduate students of linguistics and cognate disciplines, it is not a gentle introduction.
Advanced materials
The above-mentioned books are pretty good and if you work hard on them you can make considerable progress. At some point, you'll realize that they are pretty good at  analyzing data, but tell you nothing about dealing with data and doing data analysis. As your data analysis projects are getting more complex, you have to learn about these topics.
  • git and github - version control helps you to track your work and revert back to previous stages if something wrong happens
  • ProjectTemplate - helps you to organize your scripts and data
  • RUnit - unit testing is a good software engineering practice that helps you to write more reliable code
  • Phil Spector: Data Manipulation in R - this short book gives you tips and trick that helps you to use R for manipulating your data
  • John Chambers: Software for Data Analysis - an advanced book on R that helps you to put the pieces together and become an advanced R user

Nincsenek megjegyzések: