Admin

There are two courses in studium at the moment that are linked with this class
- Lingvistisk forskning och forskningsmetoder
- Lingvistisk forskning, samläsning
This is the first time the course has been run in studium and it is a bit unorganised
Lingvistisk forskning, samläsning is the correct course
But the people who enrol from outside linguistics B end up in the other course, so we’ve sent you an invitation to the samläsning course, and once you accept that I’ll take you out of the other course and you can just be in the samläsning course.
Sorry for the confusion!

Introductions

Name
Native language + other languages you speak
Favourite area(s) of linguistics (e.g. Phonetics, Phonology, Morphology, Syntax, Semantics, Pragmatics, Sociolinguistics, Historical Linguistics, Corpus Linguistics, Forensic Linguistics etc.)
Operating system (Windows/Mac/Linux)

About the course

These are the main topics we will cover:

Formulating good research questions (today)
Best practices for reproducible research (also today)
Data wrangling and visualisation in R/RStudio (the bulk of the course)
Carrying out linguistic research - we will carry out our own linguistic experiment (about cake) and you will need to analyse the data for your final assignment

We will also cover some basic statistical concepts and statistical tests, but this is not the focus of the course.

If you want to know more about statistics

I can recommend these textbooks, but they are by no means required!

Levshina, Natalia. 2015. How to do linguistics with R: Data exploration and statistical analysis. John Benjamins Publishing Company.

Winter, Bodo. 2020. Statistics for linguists: an introduction using R. Routledge

But I also recommend you to enrol in a dedicated statistics course at some point if you really want to understand statistical concepts in depth.

Prerequisites

None!

I will be teaching you how to do everything from the very beginning in class, so don’t stress!

Assessment

Weekly assignments (starting next week)
A final report (about the experiment that we will run together)
No exam!
All assessment is handed in online via studium.
I will weight it so that the earlier assignments are worth less than later assignments, so don’t worry if you find them difficult at first.
Final grades are fail/pass/pass with distinction

Schedule

Courses are on Thursdays from 14:15 to 16:00
Except in week 19, because Thursday is a public holiday, we are scheduled to have class on Wednesday from 12:15 to 14:00. That’s Wednesday the 12th of May.
This is a kind of annoying time slot, if no one has anything else on from 14:15 to 16:00 on Wednesdays we could just use the same time slot as usual?
The final assignment will be due Friday the 4th of June.
You can see the full schedule with the dates for all the assignments in the Syllabus section on studium.

Doing linguistic research

Linguistics is a very broad discipline, but there are principles and best practices for doing linguistic research (or any research) that apply to all of us, no matter what field.
In this first lecture, I’m going to go over these principles that you should have in your head before you even begin a research project.
The rest of the course will be about ways of handling and analysing linguistic data in R.

Quantitative and Qualitative research

Quantitative research: research that involves counting or measuring something.
Qualitative research: research that involves describing something.
We will be focusing on quantitative research during this course, as this is the kind of data that is best suited to analysis with R. However, the principles for doing linguistic research that I will discuss today apply equally to both qualitative and quantitative data.
In reality, most linguistic research is likely to involve both quantitative and qualitative elements. Language is not the kind of process that can be captured by numbers alone.
However, where we do have linguistic data that we can measure, this can be really useful for answering certain kinds of linguistic questions, provided we use the right methods!

How to do quantitative research

## Warning: package 'DiagrammeR' was built under R version 4.2.2

A good study should have all these elements, they should all link together, and you should think about all of them before you start your project!

Checklist for good research questions

Make sure your research questions are:

grounded in observation/previous research ✔
entail (specific) hypotheses ✔
these hypotheses should be testable/measurable ✔
and the results amenable to appropriate statistical analyses ✔

Example: phonetics

Kluender et al. 1988, Vowel-length differences before voiced and voiceless consonants: an auditory explanation.
Observation: “A widely attested finding, approximating a true phonetic universal, is that voiced stops tend to be preceded by longer vowels than voiceless stops”
Question: Why?

VOT distinguishes voiced versus voiceless stops

VOT stands for ‘voice onset time’. It refers to the length of time between the release of the stop and the onset of voicing (=vibration of the vocal folds). Voiced stops (e.g. /b/) have a short VOT, and voiceless stops (e.g. /p/) have a long VOT. VOT is the main thing that determines whether a stop is perceived as voiced or voiceless.

Hypothesis

People vary the length of their vowels before voiced/voiceless consonants to make this voicing contrast easier to perceive: a long vowel should make a short VOT seem even shorter by comparison (hence making the consonant sound more voiced), while a short vowel should make a long VOT appear longer by comparison (hence making the consonant sound more voiceless).

Measurements

Recorded someone saying /apa/, then edited that sound file creating multiple versions with different VOT for the consonant, and different durations for the initial vowel.
Asked people to listen to these different sound files and record whether they heard /apa/ or /aba/.

Analysis

As expected, VOT predicts whether people hear a voiced or voiceless stop
But for ambiguous cases, the length of the preceding vowel can also play a role

Research feeds research

How do we keep this cycle going?

With FAIR data!

Why?

(Unselfishly) It helps advance the field
(Selfishly) citations!

Making your data findable and accessible: data repositories

Data can be made more FAIR by submitting it to specialised data repositories.
These are organisations, usually non-commercial, whose sole job is securely and reliably storing scientific data and making it available over the internet.
There are lots of discipline-specific repositories, but a popular repository for linguistics is OSF (Open Science Framework).
They store your data and keep it safe for you for free! While also making it findable, accessible and citable :)

Tips for making your data findable

Provide a short description
Use tags! (called ‘topics’ on Github)
If you’re not sure what tags to use, take a look at other OSF projects/github repositories from people in your field.
Example OSF project

Data licensing

It’s important that people who can access your data know exactly what they’re allowed to do with it.
This is often achieved by means of selecting a copyright license for the data.
Creative Commons licenses are a popular choice for datasets.
They basically state that people can do whatever they want with your data, as long as they cite you.
You can read more about the specifics of the different creative commons licenses here

Github

OSF is really good for storing datasets, but code is usually stored on Github.
Github is a free service for storing code online
It is also a version control system

What is version control?

This is what version control looks like at its worst.
This is what it looks like at its best, with git.

Linking github and OSF

Github and OSF repositories can be linked! We will see how to do this later in the course.
Github isn’t good at handling large files, so its better to keep big datasets on OSF
But OSF isn’t as good as git for version control, so things like code that you are likely to change a lot you should keep on github

Making your data interoperable and reusable

It’s great if your data is findable and accessible (through OSF/Github), but this is kind of useless if its in a format that nobody can work with.
Plain text file formats are good, .pdf and .xlsx are bad!
.pdf is just completely useless, you can’t get the data out and do anything with it, you can only look at it.
.xlsx is hard for programming languages (e.g. R, Python) to read, you usually have to convert .xlsx documents into plain text files before you can really work with them, so it’s nice if you just convert them beforehand for people when you share the data.
Excel files are not reliable/secure in the long-term
- .xlsx can only be opened by excel software; it’s possible that in the future this software will become outdated/not exist
- whatever spreadsheet software eventually replaces excel may not be able to open .xlsx files
- even different versions of excel process data differently
Plain text files are much safer because they are just that–plain text–they don’t require any special software to be read correctly; they can be read by anyone (human or computer)

This doesn’t mean you can never use excel again

Excel is very convenient when you are making a database and entering data (you don’t have to worry about creating columns and you can easily see where those columns are; plain text is not readable compared to excel).
Just don’t save/share your data in .xslx, use a plain text format!
Excel can also save in plain text format, and read plain text files (we will practise doing this later)
While you are making the database, use excel, but save it in a plain text file for when the time comes to use/analyse the data (in R)

Plain text formats

There are two main plain text formats used for storing spreadsheet-style data: .csv and .tsv
.csv stands for comma separated values
.tsv stands for tab separated values
they’re not very pretty/readable, but they can be easily read into any computer program and you can look at the data in there
this simple way of storing data also makes it faster for computers to process (which is good when we are doing a lot of data wrangling in R)

When to use csv and when tsv?

tsv is necessary when you have commas in some of your fields (e.g. if you have a field with example sentences)–csv won’t work here!
tsv is also safest (you don’t run into problems with delimiters)
but csv can be more convenient to work with when you are editing data a lot because most software (like excel) will know how to read it and open it up immediately, whereas with tsv you sometimes have to tell the software how to read it/import it
because of this, csv is still more popular (even though tsv is technically better)
whatever you choose, csv or tsv will always be a LOT better than .xlsx or .pdf and people will thank you for using these formats!

Text encodings

The other thing you have to think about when sharing your data is the encoding
Often, when you save your data from a mac or PC, it saves the text in an encoding that is only readable by macs, or only readable by PCs, or only readable by computers from a certain country, etc.
When someone with a different computer to you goes to open that file, this is what they see:

It’s very stressful/scary!

Use UTF-8!

UTF-8 is a very widely supported text encoding which is capable of accurately and reliably saving characters from just about every writing system on Earth plus specialist characters like IPA symbols and mathematical notation.
Basically every computer in every country can read it.
We will learn how to save files in UTF-8 encoding later in the course (it’s not very hard)

Making your data interoperable and reusable: include a ReadMe!

ReadMes are very important, don’t forget one!
In the ReadMe you should clearly explain where the data came from (who were the speakers? what was the dictionary/wordlist?), how it was collected (what did you ask/show the speakers to get those words? did you use sentence frames?), what the columns are in your data (explain acronymys/shortforms in column names, if there’s a column called freq, frequency where?), any transcription systems used (if not IPA), etc.
Your ReadMe documents should also be in plain text format!
We will be learning a really great plain text format called markdown in the next lecture
These slides are written in markdown, and R notebooks also use markdown
Even if civilisation collapses, you will still be able to read these slides because they were written in markdown, which is plain text!
Example of a ReadMe (written in markdown)

Reproducibility

A study is reproducible if you can take the original data and the computer code used to analyze the data and reproduce all of the numerical findings from the study.
FAIR data leads to reproducibility, which is like the ‘gold standard’ of ‘good research practice’.

Why make your study reproducible?

Reproducible studies are more ‘trustworthy’. Having the data and code makes it easier for people (including reviewers/supervisors/colleagues etc.) to check the analyses you did.
You shouldn’t be afraid of this; science is a collaborative effort, and everybody makes mistakes (even fancy professors). People just want to be able to look out for you and help you so that you can do your best research.
Doing the things you’ll learn in this course (FAIR data, version control, etc.) will make this easier for them.

Why make your study reproducible?

It also allows people to re-analyse your data (perhaps asking different questions), or, to run the exact same analysis on their own data.
This helps advance the field, while also getting you more citations :)

How can R help?

This is where R comes in.
It is very difficult (basically impossible) to publish a reproducible study without coding your analyses.
But there are many more reasons to use R in your research as well

Benefits of R

It makes it easier to identify (and fix) mistakes in your analysis. Very often, you don’t realise there is a problem with how you’ve analysed something until a while after you’ve done it. When your analysis is coded, this is easy to go back and fix. But if you just did it using some software (e.g. Excel, SPSS), there is no record of what you did and you are likely to forget, so the mistake goes uncorrected.
It reduces your workload when writing up papers!
- R supports Markdown, which is a way of creating formatted text (e.g. for a paper) using a plain text editor.
- These slides are created using Markdown in R.
- Being able to write your paper in R (using Markdown) means that you can generate all your figures etc. in the paper directly from your code, so if you ever need to change anything in the figure, you can quickly update the code and the figure is also updated in your paper. No more needing to save your figures and insert them into word documents!
- R Markdown allows you to create output as a html (web) document (which also means you can have interactive plots), as a slidy presentation (what you are seeing now), or as a word document (good for sharing with your supervisors for feedback) or PDF. You can even write whole books using R and Markdown.
- There are many more kinds of output too (e.g. websites, handouts, dashboards), see here for examples.
It is also integrated with Git and Github, so working in R makes version control super easy, leading to way less stress later on!

Homework

Install R and RStudio, then download the R Project for this class. We will start working in this project from next week .

Install R from the https://cran.rstudio.com/
Install RStudio desktop version for your platform (Windows, Mac, Linux) from https://www.rstudio.com/products/rstudio/download/#download (or search for “RStudio download”)
Download and unzip the ResearchInLinguistics zip file from the R Project section on the modules page in Studiuum

If you need help, I have virtual office hours on Mondays from 14 to 16. Email me to set up a zoom meeting. If those times don’t work for you, you can also email me to set up a meeting at another time.
You can also post questions to the Discussions section of the course in studium and I will answer them there.
If for some reason you still can’t get RStudio working on your computer, I have set up an Emergency R App where you should be able to run all the R code that we use in class, so that you can still follow along during the lectures. However, you should make an appointment with me to get RStudio installed on your computer as you will need it to do the homework assignments and the final assignments.

I will also upload these slides and any other materials from our lectures to the relevant module for each lecture in the Modules section on Studium after each class.

One more buzzword: replicability

Replicability is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods).
You may have heard of the Replication Crisis in psychology.
In 2015, the journal Science published the results of an attempt to replicate 100 experimental and correlational psychology studies that had been published in three prominent psychology journals. Despite the care taken to reproduce the experiments exactly, more than half of the studies failed to replicate. When the effects were replicated, they tended to be smaller or weaker than those of the original study.
This is not just a problem for pyschology, but is widespread in science generally (including linguistics).
It is often attributed to a combination of publication bias and low-power research designs. Publications favor flashy, positive results, making it more likely that studies with larger-than-life effect sizes are chosen for publication.
Replication studies are now becoming increasingly popular and encouraged.
This doesn’t necessarily mean just doing the exact same thing; you can ‘challenge/test’ the results of the original study by, for e.g., using larger sample sizes, a different demographic of participants, introducing more control measures/randomisation, etc. (we will talk more about good experiment design in a later lecture)

Doing linguistic research

Admin

Introductions

About the course

If you want to know more about statistics

Prerequisites

Assessment

Schedule

Questions?

Doing linguistic research

Quantitative and Qualitative research

How to do quantitative research

Checklist for good research questions

Example: phonetics

VOT distinguishes voiced versus voiceless stops

Hypothesis

Measurements

Analysis

Research feeds research

How do we keep this cycle going?

Making your data findable and accessible: data repositories

Tips for making your data findable

Data licensing

Github

What is version control?

Linking github and OSF

Making your data interoperable and reusable

This doesn’t mean you can never use excel again

Plain text formats

When to use csv and when tsv?

Text encodings

Use UTF-8!

Making your data interoperable and reusable: include a ReadMe!

Reproducibility

Why make your study reproducible?

Why make your study reproducible?

How can R help?

Benefits of R

Homework

One more buzzword: replicability