Natalie Moore's Blog
Check this page for weekly summaries of my individual work. My colleagues' pages are linked right below, as well as a link back to the team blog.
This week, I finished a listening exercise designed to mark emphasis and pauses in a recorded reading of a children's story. It's interesting how tricky it can be. After several listens, it can be easy to overthink and insert marks where they don't belong. This exercise was really a preparation for the Tones and Break Indices (ToBI) system of annotation. I learned some ToBI on my own over the summer, but I reviewed the first few lessons, which introduce the H* (high) and L* (low) pitch accents and the L-L% and H-H% phrase accents. Pitch accents refer to emphasis on words within a phrase, and phrase accents are the high or low tones on which the phrase ends. I also began editing the HTML of the blog to make it more visually appealing.
I continued my studies of ToBI, which continued to delve into the more complicated pitch accents and phrase accents. As more notation comes up, it becomes easier to trick my brain into hearing things that aren't there. It's useful to look at the waveform produced by the speech to help analysis. It's not perfect, but looking at the shape of the graph can help understand the tones in a way that's harder with just ears.
I also worked to prepare some data files for analysis. These data files were collected by the CREU group last year, and they consist of various study participants reading dialogue from a comic strip, an example of which appears below. Each participant had a large sound file of them reading all of the comics, so these needed to be split into individual files for each comic.
This week, I finished reviewing the ToBI lessons. With the complete set of pitch accents, phrase accents, and break indices at my disposal, identifying prosody becomes more difficult. Distinguishing between similar accents can be difficult. Because of this, working together in pairs or a team can be useful. I also finished splitting the audio files that last year's team collected. The next step will be to begin looking at these recordings individually and using our learnings from ToBI to analyze the prosody from the research participants. I'm excited for this next step.
We have moved on to actually beginning to annotate the data collected by last year's team! This is a lengthy and repetitive process, as all 36 recordings from each participant require annotation. Listening to many at a time can also be confusing to the ears. We worked in a group for part of the process and got mostly the same annotations, though there were slight differences. We will compare with the larger group and with our instructors at the next meeting. In the meantime, I suspect that the phrases "It's raining out there" and "It's out of order again" will be haunting my dreams, after hearing variations on them over and over again.
This past week was quite a productive one. The team finished individually annotating the first participant's sound files from last year's research team. We agreed on most of the labels, which was surprising but encouraging. Michelle and I then teamed up to annotate the next participant's files. These went much more roughly. It will be interesting to compare these annotations with the larger team to see what everyone thinks of them. Annotation can really wear me down after a long session. Thirty-six sound files is a lot of annotating.
The team met to compare our results for the P18 sound files, with which were the ones Michelle and I struggled. There was a lot of lively discussion, and we were able to come to an ultimate agreement. Our advisors assured us that the disagreement was not as bad as it could have been. I'm also looking into software to generate the text grids for the remaining participants, which Nanette had been taking care of up to this point.
We've continued annotating the sound files, moving on to P20. P20's recordings were troublesome; she said everything very quickly and did not use a lot of tonal variation. Discussing these with the wider team will be interesting. It was up to me to generate the TextGrids, the template used for annotation, this time. I found software that runs on the command line, but its results weren't quite as accurate as the grids we have been using so far. I will have to fine-tune the next batch. Moreover, I've been working on writing a Praat script to add tiers to the generated TextGrids. It's fun to do some programming, even if it's frustrating to work with a language that isn't very well-documented.
We continue to annotate, skipping back to P19. To be honest, I'm growing kind of bored with the process. Soon we'll have enough files annotated to begin our data analysis, which should hopefully be more interesting. I finished writing the Praat script, which was very satisfying, and also wrote some Python scripts to help expedite the process. Unfortunately there are some parts of the procedure that cannot be automated, but doing a bit of programming in between is still entertaining and a nice break from annotating.
This week was slightly different from the past few weeks. While we've been annotating at the rate of one speaker a week, frequently the hour-long meeting we have with the full team isn't long enough to discuss all our disagreements. This week, we paused new annotations in order to try to come to a consensus on the old ones. Professor Veilleux also briefly introduced a potential new method of annotation that would be partially automated, and she recruited me to write a Praat script to work on that. I will be using Praat's built-in functions to analyze the speaker's pitch at various points. I look forward to this challenge.
A second week without new annotations. Michelle and Eleanor worked together on their project of accumulating previous labels while I developed the new Praat script. It was a frustrating task, but I succeeded. It is fun to do some programming, even in a language like Praat. One of the things that really bothered me is that there are two methods of calling Praat's built-in functions. For instance, "Read from file: temp$" and "Read from file... 'temp$'" accomplish the same thing. There isn't really any explanation given for this that I could find. On the whole, Praat isn't a very well-documented language. Additionally, it starts counting at 1. What language starts at 1 instead of 0? Madness, I tell you.
My Praat script is complete. The goal of semi-automated ToBI is to analyze the pitch at certain points in the audio recording and to label the pitch from 1-5 depending on where in the speaker's range it falls. This system would take out some of the uncertainty and disagreement we've encountered in manual labeling. While some human interaction is still necessary for locating points of interest in the recordings, it wouldn't be necessary for labeling them, which would make the process much more efficient.
With the semester winding down, we met in order to discuss directions heading forward. Michelle and Eleanor are working to analyze the data we've already collected. In the spring semester, we intend to run experiments to see whether subjects can consistently match various intonation patterns to the comics with which they are associated. Before beginning any experimentation, though, we need to get approval from Simmons' Institutional Review Board in order to ensure that we aren't doing anything immoral or dangerous for our participants. We will put together the IRB proposal and submit it so we can run our experiments early in the spring.
For the last week before break, the team met the last time for this first semester. I have been working on the Institutional Review Board proposal and intend to have it finished before I leave for break. The proposal describes what our research is on, what the experiment we intend to run will look like, who the research subjects will be, and whether there are any anticipated risks or benefits from participating in the experiment. In addition to this, I've reached out to Professor Byron Ahn of MIT in order to receive his documentation on semi-automated ToBI. We could then prepare our files for analysis using the Praat script I wrote right before Thanksgiving.
The team is on winter break. Happy holidays!
The team is on winter break. Happy holidays!
The team is on winter break. Happy New Year!
The team is on winter break. Happy holidays!
The team is enjoying their last few days of winter break. See you next week!
Since we hope to run experiments with human subjects this spring, we need to get all our documentation in order. I have completed the Institutional Review Board proposal and will submit it shortly. We need to have this proposal approved before we can begin experimentation. I've also received the documentation for SToBI, so I need to look that over and familiarize myself with the notation as well as seeing whether I need to make any changes to the script I wrote back in November to conduct analysis on the audio files. The hope is that this automated method of annotation will be much more efficient than arguing about the correct annotation, as we spent much of last semester doing.
At this point, we have two main goals. The first is to design an experiment, hopefully to begin right after spring break. Our second objective is to learn the SToBI labeling system and annotate more files with it in order to utilize the script I wrote last semester. Eleanor and Michelle have been looking at the annotated data collected so far in order to look for patterns and formulate a hypothesis. During our meeting this week, we first looked through SToBI's annotation guidelines. It is similar to ToBI with the notable difference being that annotations do not include a high (H) or low (L) marking. For a system designed to be more streamlined, it does not seem significantly simpler than ToBI.
After that, we discussed experimental design. We don't want to overwhelm the participants or force them toward a single answer. We concluded that asking participants to choose between three available comics seemed like a reasonable amount. Images from both our work on SToBI and the experiment can be found on the main blog page.
After reviewing and attempting some annotation with the SToBI system last week, Professor Veilleux encouraged us to come up with questions and confusions regarding the system. Next week we intend to have a video conference with Professor Byron Ahn of Princeton, another key member of the team that designed ToBI and SToBI. Hopefully his explanations will help resolve our confusions.
We also continued work on our experimental design by creating a spreadsheet that discusses the degrees of difference between the scenarios we are exploring depending on the kind of evidence the speaker and hearer possess. I would like to get the exact design nailed down, determining what each question in the experiment will look like, but Michelle wisely suggested that we wait until we have more data to analyze for patterns and to go from there. Honestly, I just don't enjoy thinking about annotating. This sort of subjective work is not my forte.
Lastly, I looked into the institutional requirements we need to fulfill in order to run our experiment. Aside from the completed IRB proposal, we also need to complete an online course called the Collaborative Institutional Training Initiative, or CITI. I suspect that this program will cover the guidelines of ethics and responsibilities when experimenting on human subjects.
This week I worked through the complete CITI online course. The material covered included the Nuremberg Code and the Belmont Report. These documents help guide the ethics of research on human subjects. The courses also discussed the function and purpose of the Institutional Review Board. While I don't think that our research will involve anything more dangerous than perhaps eye strain from looking at a screen, it's definitely valuable information to know and understand. Having taken the course, I'll send Michelle and Eleanor the link to the CITI website so that they can complete the courses as well.
During our weekly meeting, we then discussed SToBI. Given the limited practicality of learning an entirely new system of annotation, Professor Veilleux and her fellow researcher Professor Ahn discussed the idea of not bothering to label the rest of the audio files at all. This is good news for me, because I strongly dislike annotation. Instead, we will run our experiment with the files as-is. We need to listen through the various audio files we have collected to see which speakers have a particularly good tone and understandable voice, and we will use these for our experimental subjects. In addition to running a local experiment here at Simmons, we also decided to use a website called Amazon Mechanical Turk. This will allow us to recruit many more participants online than we might otherwise be able to reach.
With spring break approaching and Professor Veilleux hoping to begin our experiment shortly afterward, we need to finish all our preparations. Michelle and Eleanor need to complete the CITI course, so I sent them the link for that. Moreover, we need to make sure the Institutional Review Board proposal is complete and ready for submission along with its various accompanying parts, such as the informed consent form. However, we can't complete these until we have our experimental design completely laid out. To advance this objective, I have been looking at the Amazon Mechanical Turk website to develop an understanding of its functionality. Professor Ahn also recommended a resource called TurkTools, which is intended to make developing experiments over Mechanical Turk easier. I will further investigate TurkTools and start building our experimental materials.
Next week is spring break! Since we hope to begin experimentation soon afterward, we need to ensure that all our materials are gathered. To this end, I worked on doctoring some of the sound files and comics. This is necessary because in the recordings and original comics, there were slight discrepancies between scenarios. A person might say "Huh, so it's raining out there" and "Looks like it's raining out there" in separate instances. If we want participants to match a sound file to the correct comic, little things like this would be clear giveaways. I have shortened the comics and sound files to their common denominator, either "...it's raining out there..." or "...it's out of order again..." I have included an example of a comic before and after underneath. Aside from this, we also need to make our experiment presentable. TurkTools provides various HTML skeletons and Python scripts for populating them with the experimenter's own relevant data, so I may use these or work on writing my own HTML.
It's spring break! See you next week.
Fresh off of relaxing over spring break, we were ready to meet up and work together to pull our experiment together and be able to begin forming a pilot run. However, the snowstorm that hit Boston complicated matters by resulting in two snow days. Despite the foul weather, Michelle and I met on Pi Day in order to discuss the experiment. In any given question, we want participants to have three comics to choose from and with which to match the sound file. However, the question of which two comics to use, aside from the correct one, remains disputed. Michelle and I decided that it would be best to make an explicit listing for each question and write down the right answer and the two detractors. Aside from this, I have been working with the TurkTools programs and provided templates in order to make our experimental materials for use on Amazon Mechanical Turk.
The HTML for the experiment has been mostly completed! I just need to add the consent form and figure out how to host the code on Mechanical Turk. Michelle and I also discussed our concerns about the chosen speaker with Professor Veilleux. We listened to the other speakers to see if we could find one with more variation in her prosody, and we found a few candidates. Hopefully Professor Veilleux will agree with our selection and we can get back on track. Obviously we can't put the experiment on Mechanical Turk until we have the chosen speaker and the associated sound files. I also did some reading about Mechanical Turk and discovered various sources expressing concern about the low rate of pay and lack of resources to express displeasure for the workers, the people who will be taking our survey. I want to make sure we include contact information in our experiment and pay a decent wage.
Below is a screenshot of what the survey looks like:
With the project code finally completed and Professor Veilleux's blessing, I posted the experiment to Mechanical Turk on April 10th. Due to Mechanical Turk's fees, we were aiming to get about 140 participants. However, a participant emailed us the next day to let us know that there was a bug in the code that allowed someone taking the survey to check more than one radio button. In essence, any results we'd already gotten were functionally useless, and we'd gotten about 50 results. I immediately canceled the batch and discovered the error pretty quickly. As I wrote previously, only HTML input elements with the name attribute set were returned in the MTurk results spreadsheet, and the radio buttons themselves weren't where I was storing the input from each question. Because of this, I had carelessly deleted the name attribute of the radio buttons, thinking it irrelevant. However, as it turns out, the name element is what connects sets of radio buttons in HTML and allows only one of them to be clicked. My carelessness and lack of testing in this matter cost us precious time and resources. I immediately canceled the batch, but the damage was done. I suppose this is a valuable lesson in the importance of testing, double-checking, and having others review one's work. I fixed the bug and posted the experiment once more. Hopefully this won't be too damaging a setback.
This Tuesday was Simmons' Undergraduate Symposium, where undergraduate students have the chance to present their research and projects to their peers. Michelle, Eleanor, and I presented our research as part of the poster session. Due to the time constraints with regards to when we posted our experiment on Mechanical Turk, we didn't receive our results before we had to send our poster design in to be printed. We left the "Results" section blank with the plan of printing out our results and pasting them on to the poster at the Symposium. On Monday night, we collected the data from Mechanical Turk. I used R to do some initial filtering on the data, such as eliminating results from people who selected that English was not their first language. Then we looked at the broad data for patterns. Given the time crunch, we only looked at the percentage of people who selected the correct answer. This ended up being about 34%. Given that there were three potential answers, this amounted to little more than random guessing.
This was a bit of a demoralizing blow for me. I felt as though I must have made a mistake in the code that resulted in the experiment's failure. However, Michelle and Professor Veilleux pointed out that sometimes research just doesn't work out and that there were other variables. In particular, people on Mechanical Turk could be a biased sample because they are probably taking many surveys and not necessarily giving each task their full attention. Moreover, trying to replicate the nuances of speech and conversation in a laboratory setting is difficult. Eleanor suggested also that we should look at individual participants to see whether there were higher success rates there. Perhaps all hope is not lost.
I used our results spreadsheet to get some more information than just the average number of correct answers to see if any participants did better or worse. I've never tried to do this sort of analysis in Excel/Google Sheets before, and it was my first time working with writing formulas. However, it was quite easy and intuitive. Google Sheets even has built-in functions for working with regular expressions, which I utilized. The syntax is very familiar to anyone who's programmed. I used this analysis to determine each participant's percentage of correct answers. There was a lot of variance here, with some people getting as low as 17% and some doing up to 50%. Nobody got much higher, however. I then did participant-by-participant analysis for just the copier questions ("It's out of order again") and the raining questions ("Seems like it's raining out there"). I sort of expected that raining would have a higher success rate, but that wasn't true across the board. In fact, the highest percentage I saw was a participant who got 71% of the copier questions "correct"; that is, their chosen comic corresponded with the recording. Now I'm going to hand the data off to Eleanor, our resident data scientist, to do more work with it.
Back to the top