Transcription

This page explains the most important aspects of the transcription theory employed for the creation of the corpus text files. The transcription rules are designed to be as objective as possible to ensure consistency but at the same time relatively simple to produce good readability. For details, please see the complete Transcription Manual. For a rough sketch of its principles and conventions, please browse through the short summary below.

Transcription Manual

Click here to access the complete transcription guidelines used to transcribe the audio files of the corpus.

I would like to thank Manuela Schoenberger for reading and providing helpful comments on the transcription manual.

Overview over Transcription Guidelines

1. Time Stamps and Tokenization

The corpus is completely time stamped. Every new unit has a time stamp that indicates where in the audio file the corresponding transcribed speech can be heard. It has the format [hours:minutes:seconds], e.g., [00:03:12] (for formatting and time stamps, see section A in the manual).
The corpus is also completely tokenized into sentence tokens. A sentence token is essentially an independent main clause with an overt subject and a finite verb as well as its dependents. Tokenization is one of the most difficult aspects of transcription. There are many specific rules ranging from ambiguous cases, coordination, parenthetical clauses, and direct speech, to exceptional tokens that do not strictly follow the general definition. An example of a series of sentence tokens introduced by time stamps is shown below, with the subject highlighted in yellow and the finite verb in green (for tokenization, see section B in the manual).

[00:00:00]		Welcome to this week's "Top Stock Picks".
[00:00:02]		I'm Tracey Ryniec.
[00:00:03]		And I'm joined at the chairs this week by Sheraz Mian.
[00:00:06]		And we have a couple of interesting stocks.
[00:00:09]		One is an old Dow component.
[00:00:11]		And the other one is a weight loss company, but not the one you might think.
[00:00:14]		So Sheraz, we're gonna start with you with the Dow component.
[00:00:18]		I'm kind of surprised you picked Caterpillar because I haven't been watching it but the last I looked, it was kind of down on its luck.

2. Fillers and Disfluencies

The common fillers usually rendered uhm , er , erm , uh etc. in writing are uniformly transcribed as ER (capital E, capital R, no punctuation signs). In the examples below, this common filler is shown in orange.

[00:00:09]		ER they're doing a lot of the shell ER drilling up in the Dakotas, which is the really hot area right now.
[00:00:15]		And they're seeing a lot of ER big finds up there.

Disfluencies of all kinds are indicated with three, independent dots, ... . Disfluencies are fragmentary syntactic units of all kinds. They can be long or short, complex or simple, repetitions, corrections, accidents or false starts. The below examples show the disfluent material along with the three dots ... in purple.

[00:02:39]		But, you know, a ... their outlook is still very positive because they're keeping their ... their costs in other areas.
[00:02:44]		And I don't know if they have ever ... they're doing quite well right now.

Lexical fillers like you know, I mean are separated by commas. The transcripts do not indicate pauses or non-linguistic noises like laughter. Mispronounced words are treated as disfluencies or can be corrected without further indication (for fillers and disfluencies, see section C in the manual).

3. Specific Conventions

The guidelines include a number of specific transcription rules: The transcripts use American spelling conventions, e.g., realize NOT realise. All numerals are spelled out in full, e.g., two thousand twenty, NOT 2020. The inclusion and spelling of interjections is regulated by a comprehensive list, e.g., phew, whoa. Some contractions are indicated, e.g. it's, 'course, others are not, e.g. them NOT 'em. Acronyms and letters are spelled with upper case letters and dots, e.g. U.S.A., the F. student (for specific transcription conventions, see section D in the manual).

4. Capitalization and Punctuation

Capitalization and punctuation largely follows standard English orthography. However, there are also some differences. For example, direct speech is included in single inverted commas. Direct speech is below is shown in read.

[00:20:10]

And I said, 'Well, how much do you make?'.

Media titles, like songs, books and video games, are included in double inverted commas. In the sentence below, a book title is shown in blue.

[00:00:29]

If I were you, I'd write a book called "The Art of the Deal" because people are interested in deals.

Commas, full stops, question marks and hyphens are used, sometimes liberally, in the transcripts. Other punctuation signs, like semicolons, exclamation marks or brackets do not appear at all (for capitalization and punctuation, see sections E and F in the manual).

The Student-Transcribed Corpus
of Spoken American English

www.SpokenCorpus.org

Transcription

Transcription Manual

Overview over Transcription Guidelines

1. Time Stamps and Tokenization

2. Fillers and Disfluencies

3. Specific Conventions

4. Capitalization and Punctuation

The Student-Transcribed Corpusof Spoken American English

www.SpokenCorpus.org

Transcription

Transcription Manual

Overview over Transcription Guidelines

1. Time Stamps and Tokenization

2. Fillers and Disfluencies

3. Specific Conventions

4. Capitalization and Punctuation

The Student-Transcribed Corpus
of Spoken American English