Michelle Medici's Blog
Check this page for weekly summaries of my individual work. My colleagues' pages are linked right below, as well as a link back to the team blog.
This week I read through the ToBI textbook and worked on the exercises. I am now more familiar with the vocabulary that we will be using throughout the year that's related to ToBI. I also have become more aware of how the Praat software works. I have had a lot of trouble figuring out how Praat works, so I will continue to play with Praat. I also completed the rapid transcription exercise. This has helped me get a better idea of what we will be looking for when we use ToBI.
This week I continued to work through some ToBI tutorials. I learned more about how labeling in Praat works and I have gotten better at figuring out when speakers are going from a low to high pitch or the other way around, which is something I had been struggling with. It helps for me to be looking at the pitch contours so I can see while I listen. I also found that if I repeat what the person says the same way they say it, I can feel my own voice going up or down, so that helps a lot.
This week we had a meeting where we heard the same phrase with all of the different tonal patterns. This was really helpful, because I have been hearing the different patterns on different phrases, so it was hard to compare them. I also found that there can be some debate on how to label certain words. Thereâ€™s not always a clear answer, so it will be helpful to check with the other researchers to get their opinion.
This week I started labeling a recording that was collected by last year’s group. I met with Natalie and Eleanor to compare our labels, so we could make sure that we were all on the same page. We disagreed on a few sounds, but for the most part we agreed on labels. This was really helpful, because unlike the tutorials, our recordings obviously don’t have answers for us to check. We will continue to compare our labels to be sure that we are all on the same page.
This week I met with Natalie and we labeled a new file together. I also got everyone’s labels from the past file that we hadn’t compared yet. I compared all of the labels and created a new file with all of the labels that we came up with. For each label I used the answer that most people had. If people had other answers I put those in the comment row.
This week I collected everyone’s labels for a new file. I compared the labels and created a new folder with the consensus labels. There were some labels that we disagreed on a lot, so I put those aside. We are going to talk about those labels together in our meeting, so we can decide on one label.
This week at our meeting we talked about the files that we disagreed on from last week. We each took turns explaining why we put the labels we put to help us come to an agreement. Natalie and I met to label a new file. This one went much more smoothly for us than last weeks! We had very little disagreement. Again I collected the labels from everyone and created a new folder with all of the consensus labels. I put aside the ones that we disagreed on, so we can discuss them during our meeting.
This week we labeled a new file and I collected all of the labels from everyone and compared them. I wrote down the labels that we disagreed on so that we can talk about them in our meeting. At our last meeting we talked about the labels we disagreed on from an earlier file, and came to a consensus on all of them. This week we will start to do an analysis on the files that we have labeled as well as continue to label new files. As we analyze we will also add in new files when we finish labeling them.
This week we finalized another file of recordings together at a group meeting. So now we have two files that are completely done. I sent the labels that we disagreed on from the other two files to everyone. They reviewed them, and checked if they could change their minds on any labels after seeing what everyone else said. Then they all sent me those back. I took a look at those, and we cut down on the number of disagreements by a lot, so now we will have time to work on starting the analysis during our next meeting.
This week Eleanor and I met up to talk about how we want to display the consensus labels so that we can compare all of the files. We are going to have a separate spreadsheets for each recording, so that we can easily compare all of the different speakers. We started filling those out, and we will continue to add to them as we continue to label more files.
While I was trying to fill in more excel sheets with labels for analyzations I ran into a small problem on several files. Some files that have the same name do not have the same recordings, so they don’t match up correctly. I completed the files that did match up. I will talk to my group tomorrow about the files that don’t match up, so we can figure out where the mistake is.
This week I worked on more spread sheets. These spread sheets don’t show every label in each file. Instead these spreadsheets focus on the target word in each recording. For example in recordings that say “it’s raining out there” raining is the target word. Eleanor and I met up a few times to talk about how we would like to display this data. We came up with a few options and we plan to discuss them in our meeting tomorrow.
This week we met for the last time. We talked about what our plans were for before break and during break. We are putting together final spreadsheets of consensus labels. During the break we are going to look over the consensus labels and try to make a hypothesis about when people use emphasis and whether that emphasis is high or low. We will eventually make a hypothesis all together and test it out next semester.
We are on winter break! Happy holidays!
We are on winter break! Happy holidays!
We are on winter break! Happy new year!
We are on winter break!
We are on winter break.
This week we met for the first time after winter break. We talked about the how we plan to make a hypothesis using the data from the spreadsheets we made. This hypothesis will be our guess about when people use emphasis on the target word in each statement. For example we will look at all the statements where the speaker has direct evidence of what they’re talking about and the hearer has no evidence. We will be looking for any patterns that stand out, like participants tend to use a high emphasis on the target would. Eleanor and I met up to try to analyze the data. We found that because we have so little data so far it was very difficult to find any strong patterns. For this analysis we were only using the spreadsheet that displayed data for the “it’s out of order again” recordings. We plan to meet up again before next week to look at the “it’s raining out there” data, which we think will show us more clear patterns, because in these recordings raining is definitely the target word, whereas in the other recordings, out and order could both be target words. What we did find while looking at the data is that it seems like when a speaker has direct evidence of what they’re saying, then they typically emphasize the target word with a high pitch.
This week I started analyzing the raining recordings. I noticed that when a speaker has direct evidence of what they are saying, they use a high emphasis on the target word, while the emphasis for speakers who have indirect evidence is more scattered. This is similar to what Eleanor and I saw in the out of order recordings. With the raining recordings I also noticed that when the hearer has no evidence of what is being said, the speaker the speaker is more likely to use a high emphasis on the target word than if the hearer has any type of evidence. In our meeting we decided that we will look at the raining recordings to find the patterns and then we will look at the order recordings to see if the patterns show up there too. Natalie, Eleanor and I met up to talk about a new system for labeling our recordings that will hopefully be faster. This new system is called SToBI, and it’s very similar to ToBI. The biggest difference is that we won’t be noting whether something is high or low. Instead we will just note that there is an emphasis. I think this will be much better for us, because we would spend a lot of time trying to figure out if something was high or low.
This week we met to talk about how our questions we still have about SToBI. We are still confused about the “ - “ label and the “ ’ ” label. For me, this system isn’t much easier than ToBI, because I was almost always very positive about whether something was high or low. It was determining if there was actually an emphasis that gave me the most trouble. We also met up to talk more about how we plan to design our study. We created a matrix to show what combinations of speaker evidence and hearer evidence we should put together on one question. We marked the degree of difference between each combination from 0-3. For example S1H0 has a degree of 0 difference from S1H0, which means they are the same. S1H0 has a degree of 1 from S1H1 or S2H0. This means that there is only one label that’s different from S1H0, and it’s only off by 1. This means that the combination is only slightly off. S1H0 has a difference of degree 2 from S1H2 and S2H1. S1H2 has one label that’s off by 2 from S1H0. S2H1 has two labels that are off by 1. This means that the combinations are completely off. S2H2 has a difference of degree 3, because one label is off by 1 and one label is off by 2. This means that the combination is off by even more than a degree of 2. In our experiment, for each combination of labels, we plan to present a 0, 1, and a 3. For some combinations, there are no 3’s. We will use 2’s in these cases.
This week we decided that we are going to take a step back from SToBI, and try to design an experiment based off of the data that we already have labeled. We will use that data to make our hypothesis for the design. Unfortunately, I found a small error in one of the files that we already examined. It turns out that some of the textgrids from one file were named incorrectly, so I had to redo all of the spreadsheets, but thankfully the data still showed similar results. We decided we will choose a several speakers, 4 to 6, who spoke very clearly to be the speakers we use on our experiment. I've picked out the speakers who I thought were the clearest speakers and I plan to compare with the speakers that Natalie and Eleanor picked. For our experiment we are going to have two variables: how much evidence the speaker has and how much evidence the hearer has. I researched Latin Squares a little bit, because we plan touse them as a way to help set up our design. After drawing out a Latin Square with our variables, I am not sure if I used it correctly, and if I did use it correctly, I'm not sure how to interpret it. I created a 6X6 table with S1H0-S2H2 as the variables, for the rows and columns. I plan to ask Nanette for guidance on this.
This week I added labels from another speaker to our spreadsheets. This data match up well with the data that we already had. In our meeting we talked more about how we plan to set up our experiement. We picked out one speaker from our recordings. This speaker speaks very clearly, which will make it easier for participants to hear how she is pronouncing everything. I went through some of her recordings to make sure the labels on her recordings match up with the patterns that we saw in the spread sheets. I didn't think that this speaker shows what we see in the spreadsheets very well. This speaker does almost the same exact thing no matter how much evidence she or the hearer has, which is not what we saw in the spread sheets. She almost always has an H* on the target word in every recording, but we were seeing that most of the speakers use more emphasis when the speaker has direct evidence, and less when they have indirect evidence. I am going to meet with Eleanor to see if she agrees or if she thinks it is a good match. We talked about the Latin Squares for setting up our experiment. Nanette explained that we should use a 3X3 table with almost right, right, and wrong as the rows and columns. Right would be the correct answer, almost right would be similar to a difference of degree 1 on the matrix that we created a few weeks ago, and wrong, is a 2 or 3. The latin square just gives a visual of what answers we should show for one question. We also talked about planning a pilot study for the week right after our spring break, which is coming up soon. We are planning to set up two. In one, there will be one picture and two recordings. The particpant will have to choose the correct recording. The other one will be what Natalie, Eleanor, and I have been thinking about, where there's one recording, three pictures, and the participants must choose the correct pictures.
This week we talked about the speaker who we picked out to be used in our experiement. We decided that it was a good thing that she always has some kind of emphasis on the target word of every recording, because the point of having a set target word in these phrases is that the word should have an emphasis on it. However, I am still soft of skeptical of this speaker, because I don't see any differences between her saying something with direct evidence or indirect evidence or difference between her saying something to someone with no, indirect, or direct evidence. Nanette mentioned that maybe because of this, we need to use another labeling system, which measures the pitch on a scale, rather than just high or low. We could also pick another speaker, who matches our data. I also looked into figuring out exactly what pictures will be shown with which recordings. We omitted the experiement where we have two recordings and one pictures. We are definitely doing the experiment where we give one recording, and the participant picks one of three pictures. We also decided to not show pictures that are three degrees different from the correct picture, because we think those answers are very obviously wrong, and some pictures don't have any options that are three degrees away. So for an S1H0 we will choose from S1H1 and S2H0 as close, but wrong and we will choose from S1H1 and S2H0 for wrong. I also took a course from the Collaborative Institutional Training Initiative, so that we can get authorized to do our experiment on human subjects.
It's spring break! See you next week.
Happy Pi Day! To celebrate Natalie and I ate pizza while we talked about final decisions on what pictures to show for each question. Here are the options for each combination of speaker and hearer evidence:
S1H0 Close: S1H1, S2H0 Far: S2H1, S1H2
S1H1 Close: S1H0, S1H2, S2H1 Far: S2H0, S2H2
S1H2 Close: S1H1, S2H2 Far: S2H1, S1H0
S2H0 Close: S2H1, S1H0 Far: S2H2, S1H1
S2H1 Close: S2H0, S2H2, S1H1 Far: S1H0, S1H2
S2H2 Close: S2H1, S1H2 Far: S1H1, S2H0
Each question will have one from its close list and one from its far list. Natalie and I talked about the idea of manually picking out exactly what pictures will show up, so that we could have a round focusing on speaker evidence and a round focusing on hearer evidence. We decided that we would set our experiement up so that it randomly picks out one from each list. We figured that we will still be able to see if people are better at identifying speaker evidence or hearer evidence, or if people are good or bad at both. If we find that participants do not tend to get the exact answer, we will see if they tend to identify the correct speaker of hearer evidence. For example, on S1H1, we might find that most people are picking either the S1H1 picture or the S1H0 picture, but not the S2H0. If we see this trend, then we might assume that hearer evidence doesn't have as much of an effect on how a speaker says something.
Because of the storm this week, we weren't able to meet up, so I still plan to confirm that we should stick with the speaker we've picked, or decide that we should switch.
This week we met to talk about the experiment. We actually decided we woundn't make the combinations of answers random. We ae going to have test all combonations of correct answer, one close answer and one far answer. Because this would be a lot of questions for one person to answer, we are going to have three sets of questions. That way participants won't start to get bored and give bad data. Natalie and I talked about the speaker who we picked out to use on the experiement. Natalie tried a few questions, and she found that it was really hard. She said that she had a hard time telling the difference between any of the sound files, because it sounds like she's doing the same thing every time. This is what I saw in the spreadsheets. We are going to show Nanette this at our next meeting. I also went through some more audio files to chop them up, so they can be used for our research.
This week we talked to Nanette about how we thought that the speaker we had picked out wasn't a good fit. She told us to find a new speaker, whose recordings actually showed similar patterns to what our data showed. For example we looked for speakers who used more emphasis on lines where they had more evidence. Natalie and I listend to all the speakers and narrowed it down to two speakers who both spoke clearly and matched the patterns we were looking for. We also noticed that some of the chopped files we have for speakers were named incorrectly. For example, an S1H0 might have been named S2H1, so we plan to fix those files, for future groups that use these recordings. I also went through the files that I chopped last week and named them, so they tell you what kind of phrase it is, like S1H0 or S2H2.
We got approved by the Institutional Review Board to run our experiment, so as soon as the experiement is ready to run, we can start running it! We all agreed on a new speaker for our experiment, because she has noticibly different prosody for different situations. I chopped her recordings, so that they are all the same. The original recordings include extra words that might lead participants to get the answer correct, instead of the prosody. For example one of the lines is "seems like it's raining out there." A participant might use the "seems like" part of the sentence to make their guess. I made them all say "it's raining out there." or "it's out of order again" and nothing else.
This week we got out experiment up on Mechanical Turk! We are hoping to get about 150 participants and we will pay them 25 cents each for taking our test. We hope to see that participants answers confirm our hypthesis that speakers use more emphasis when they have direct evidence and slightly more emphasis when the hearer has no evidence. If not we will have to create a new hypothesis about when speakers use more or less emphasis, which will probably mean that we have to label more recordings. I also continued to work on our poster for the Simmons College Undergraduate Symposium.
This week we completed our poster for the symposium, except for the results section, because we don't have any results yet. We are putting the experiment on Mechanical Turk again, because we aren't able to use te data from the last round. It turns out that there was a small bug, which caused all of the data to be meaningless. Luckily only about 60 people had taken it by the time that Natalie figured it out, so we didn't lose too much money. We plan to analyze the data once we have about 150 participants from this round. I'm really looking forward to see how people do!
This week we got results from our experiement! The way that they wee displayed in the spreadsheet made them hard to analyze, so I wrote some Python scripts to clean up the data a bit. I was very excited about this, because I used regular expressions. I've always had trouble with regular expressions, but this time I really understood what I was doing and it worked! Once I had better data to work with I wanted to get a quick look at how people did overall. I wrote a Python script to see how many people got each question correct. It turned out that only 36% of them were correct. This is concerning, especially because there were only three options for each question, so 36% correct is about what you would expect if everyone was guessing on every questions. Then I tried to see if participants were better at guessing just the speaker's evidence. This time the results showed that they were 62% correct. At first I thought that was great, but then I realized that every question had three options, and two of the options had the correct speaker's evidence, so 2/3 of the answers would work. Again 62% is about what you would expect if everyone were guessing. Lastly I tried looking at whether people coudl correctly pick the hearer's evidence. The results showed that people correctly picked hearer evidence 49% of the time. On half of the questions the hearer evidence was correct for 2 out of 3 answers and for the other half of the questions the hearer evidence was correct only 1 out of 3 answers. So 49% is also what you would expect if everyone were guessing. We talked about how it could be that the experiement is just too hard, there could be some mistakes with the code for the experiment, there could be errors in my code where I was cleaning the data or analyzing it, there could be some bad particiants that we need to filter out, or we need to analyze the data further. We saw that some particiants did very badly, whereas some did much better, so some people might be throwing off our data. One problem with Mechanical Turk is that some people are taking a lot of tests, so they get bored and just start clicking A for every question. We plan to get rid of data from particiants who seem to have done that. Also we want to look at specific questions, to see if maybe there were some questions that were easier or harder to answer.
This week we started taking a closer look at the results. We looked into how individuals did, but even then we didn't see that anyone was doing all that well. No one got much higher than 50%, and some did very poorly - under 20%. Eleanor and I looked through recordings that we labeled during the first semester to try to find some that were labeled exactly the same, like had an H* on the target word and an L-L% boundary tone, but had different contexts, like one is an S1H1 and the other is an S2H2. These can be useful in helping Nanette on some future work. We talked about whether or not we actually think the theory that speaker and hearer evidence actually affects speaker commitment. We all said yes, because when we listen to the recordings, we hear it. We talked about how maybe particiants didn't do that well because it wasn't explained well enough and it was their first time trying this. Even we had trouble at the beginning. Eleanor and I suggested that when Nanette improves ToBI, she includes some kind of scale for rating an H* and an L*. Many times we found that some H*'s were very prominent and some were a little weaker. One problem with this ranking system would be that it is hard when you first start hearing a person speak to decide what their normal level is. What I mean is that if I meet someone for the first time, and they're excited, I might assume their current pitch is their normal pitch. Computers would have an especially hard time figuring out where a person's normal pitch lies.
Back to the top