Online Survey Testing Music Perception

Hi everyone! Aside from being a casual composer and long time lurker here, I am a PhD student at Yale University's Department of Computer Science and have been doing research in computer music. Within that field, it is unfortunately very common for musical models to be proposed but never formally tested – in fact, there is no consensus on what procedures or metrics to use in some cases. In an attempt to bridge that gap, I have been developing a participant study to test various hypotheses about algorithm performance and music perception. I am currently running a pilot study version of the experiment to test the robustness of the system (lag, audio playback problems, etc.) on various platforms and internet connections. If you would like to participate, the pilot study is here:

http://www.donyaquick.com/php/public_experiment.html

If you decide to participate, you will be asked to listen to short clips of music and guess whether the music clip was created by a human composer or computer algorithm. You will also be asked questions about your musical background (i.e. how many years of musical training you have had). In all, the study should take no more than 20 minutes to complete.

If there are any questions, I am happy to discuss the nature of my work in general terms here, although obviously I can't give away some details about this particular experiment until all data from the final study has been collected. Also, the examples are randomized, so there is no way to do a direct comparison of answers from one run to another.

If you experience any technical problems, please let me know the conditions under which they occurred, such as what browser/device you were using at the time. Some browsers (particularly older versions) do not support the parts of HTML5 that the study uses to deliver audio. The study has also not been tested on any mobile devices, so I have no idea whether it will display/perform correctly on tablets, phones, etc.

You need to be a member of Composers' Forum to add comments!

Join Composers' Forum

Email me when people reply –

Replies


  • Ja, ja, Sie verstehen. Eine Schande, I could never play like this with my vooden arm.


    Fredrick zinos said:

    Johann,

     Your confusion about Clinton’s playing is possibly due to the fact that it isn’t an algorithm,

     but an Al Gore Rhythm

  • Fredrick: in participant studies or psychology experiments, exact information on the planned analysis (basically the methods section of whatever paper is written) is rarely given publicly until publication of the results. With online survey-style experiments, it is also quite common for a single person to administer the experiment itself and perform the initial analysis, since there is simply no need for a team of people to be involved when the study and initial analysis can be largely automated.

    Roger: generally yes. In that regard, your chess analogy is a good one. However, the chess algorithm actually has the advantage when it comes to evaluation of its performance because of how well-defined the problem is – we know when a chess game is won or lost with rather a lot of certainty, which is not the case with many musical problems. The matter of whether a human playing chess against an opponent thinks he/she is engaged with a human or a machine would be a classic Turing test. And yes, there is a modified and somewhat more complex version of that type of testing involved.

    Bob: I will try my best to address all the things you listed.

    1 " it is unfortunately very common for musical models to be proposed but never formally tested" What kind of musical models for what kind of testing?

    A musical model is some formalization of a music theoretic idea. A big example would be the Schenkerian idea that everything reduces to (or is derivable from) I-V-I or V-I according to a collection of rules. A smaller model would be a set of rules governing voice-leading behavior for a particular style of music, or a collection of rules describing the shape of jazz leads. Often these models are represented in a way that can be used either analytically or generatively, like a collection of dice. If we have some assumptions about how the dice behave (the model), then we can analyze an existing series of dice rolls to see whether our dice might have produced it. Or, we can make new rolls to see if the original assumptions about the dice themselves were reasonable.

    Now, suppose someone claims to have an accurate model for Western harmony and shows lots of examples of how well it analyzes existing music. This sort of analysis-only presentation in a publication is quite common. But, regardless of how well the analysis went, if that same model is more inclined to generate weird sequences like a never-ending II-V-II-V-II-V-... than a reasonable case like II-V-I, then clearly the model is actually missing something really important (and the example given is not a straw man; I actually worked with something much like this recently). Determining what a model is missing is sometimes easier by examining what it produces rather than looking at how it analyzes existing work. However, establishing this for more complicated models with more ambiguous claims is more difficult, and that's where any sort of field consensus falls apart.

    2 " algorithm performance and music perception." Is this your "can people tell the difference between computer generated and machine generated music?"

    It is common in psychology experiments for participants to be asked to do a straightforward task that is analyzed to look for correlations between many different things. In this case, the task is “distinguish the human-made score from the computer-made score.” How well people did at that task is one thing that will be looked at since it is the simplest score that can be derived. However, there are other things that will be looked at too that fall into the broader category of “algorithm performance and music perception.”

    3 "working with multidimensional data, new categories of grammars, machine learning algorithms, and so on." I think I get the machine learning algorithms, but not the rest.

    Multidimensional data just means things that require many dimensions to be represented. A musical example would be the collection of pitches played simultaneously by an orchestra at a given point in time; if one performer is one dimension then the dimensionality can be quite high. A geologic example would be a list of readouts from a giant array of seismometers (which is something I have also worked with). You can have time series of both of these, and such data is often a real pain to deal with efficiently on a computer in a reasonable amount of time. For example, whether an algorithm finishes in minutes or months can be down little things like the exact way that data is represented in memory.

    Grammars are part of modeling spoken languages and other formal languages (which includes things like programming languages). A grammar is a set of rules that indicate the structure of sentences within a language. For spoken language, words have different abstract types: nouns, verbs, etc. and those have to be in a certain order to be syntactically correct or “grammatical.” Grammars been applied to things other than spoken language, including fractals and music. In music, instead of using labels like “noun,” abstract labels can be things like “subdominant chord,” “tonic region,” “phrase,” etc.

    4 "The first reason is that it holds a sort of inexplicable fascination for a decent portion of the population". Define "decent".

    This seems like the sort of question that is better answered by a literature search than asking me to give an exact descriptor. There are enough people interested in getting machines to make music that there are regular and long-standing conferences where related topics are discussed and argued over. If you want an exact quantity of people who respond positively vs. people who don't, I haven't been keeping tally. I am speaking from personal experience and observation of other people in the described situations – both involving the general public and in more academic settings.

    5 "but if you run the same model generatively and the output is garbage...that means it didn't capture much." Who, in this case, determines if the output is "garbage"?

    Because of the nature of some of the hypotheses being tested in this case, the participants are the judge. A single person, regardless of background, cannot be the judge for reasons I can't go into. The participants' responses as a whole will either disprove or support several hypotheses about various models' output.

    6 "Because of the unfortunate qualitative relationship between a score and its many possible performances, the performances must be normalized across cases if there is any hope at testing more basic features of a score." Yet it is that very "unfortunate relationship" that makes it music and not garbage.

    The notion that compositional features can be assessed at least semi-independently of the performance is rather widely assumed in music theory. If that assumption is wrong, the results will show as much. 

  • Mr. Z. , of course you are right. BUT- short of writing a book to communicate every detail of every detail

               I think it reasonable to assume that Donya knows apples ain't oranges and starting with a general

               hypothesis, would segment the study with some kind of rational relativity. Remember , he said

              'developing'. The clay is still wet and pliable. Who knows, maybe he'll mold it into something

              no one has seen before. i.e.  there is no such thing as free energy.... Want to bet????  RS

    Fredrick zinos said:

    "Are you attempting to find math sequences and 'formulas' underlying generally favorable

    tone patterns that would be indistinguishable from human inventions"

    That's an excellent question Roger.

    Possibly that question needs to be modified to say: "Generally favorable tone patterns for western ears in the 21st century."

    Online Survey Testing Music Perception
    Hi everyone! Aside from being a casual composer and long time lurker here, I am a PhD student at Yale University's Department of Computer Science and…
  • Much has already been said in this discussion, but I could not resist adding additional viewpoints. In fact, I think that this investigation has fundamental shortcomings. I listened to the short clips in your survey, and in principle, I do not see how it is possible to decide whether these are generated by a computer or are put together by a human for the following reasons:

    l) Who wrote the “human” short clips? Was it only one composer? We know nothing about this composer; it could have been someone who produces crap. In that case, the computer may have done a better (or one might say “a more human-like”) job.  Or did you take some fragments from phrases, written by well-known composers?  In any case, the four examples you presented are not much of guidance. The basic problem is that the clips are far too short and simple (see my second point) and might therefore have either been put together by a human person or by a computer.

    2) I do not think that the short clips you presented can be called “music”.  At most, these clips could be called fragments from musical phrases, and as such, they are rather empty. And I think that it is a misconception to make any conclusions about real musical pieces from such fragments. It is like taking one single phrase from a novel. Most people could write a nice phrase which would fit in well, but from this you cannot make the extrapolation that everyone can write a good book. The same is true for music. It would be effortless for most composers to put together the kind of clips you present (even fragments, which may have been taken from work, written by a famous composer), but this is totally different from writing a complete (good) musical composition!

  • Fredrick: you will be able to evaluate your assumptions and find answers to your analysis-related questions when the results are published. Obviously, you are not obligated to participate in this pilot study if it is not to your liking.

    Johan:

    l) Who wrote the “human” short clips?

    As explained in some previous posts, I can't provide details like this before the full study is complete.

    2) I do not think that the short clips you presented can be called “music”. At most, these clips could be called fragments from musical phrases

    The study does refer to the stimuli as phrases: “Phrase 1 of 40” and so on.

    It is like taking one single phrase from a novel.

    Testing sentence generation is actually something I mentioned previously in one of my other replies. It is one way that generative models for spoken language are tested.

    It would be effortless for most composers to put together the kind of clips you present

    There are phrases from both humans and machines. Remember that what is effortless for a human and effortless for a computer is often vastly different. There are many tasks that more or less any human can do well that are still a serious challenge for computers.



  • Donya Quick said:

    Fredrick: you will be able to evaluate your assumptions and find answers to your analysis-related questions when the results are published. Obviously, you are not obligated to participate in this pilot study if it is not to your liking.

    Johan:

    l) Who wrote the “human” short clips?

    As explained in some previous posts, I can't provide details like this before the full study is complete.

    Well, this confirms what I said before. It is impossible to know if the clips are designed by a computer or produced by a human person, which makes this study to stand on loose ground.

    2) I do not think that the short clips you presented can be called “music”. At most, these clips could be called fragments from musical phrases

    The study does refer to the stimuli as phrases: “Phrase 1 of 40” and so on.

    Your thread is called- On-line survey of testing MUSIC perception 

    It is like taking one single phrase from a novel.

    Testing sentence generation is actually something I mentioned previously in one of my other replies. It is one way that generative models for spoken language are tested.

    And what kind of conclusion about music would you like to draw from such a test?

    It would be effortless for most composers to put together the kind of clips you present

    There are phrases from both humans and machines. Remember that what is effortless for a human and effortless for a computer is often vastly different. There are many tasks that more or less any human can do well that are still a serious challenge for computers.

    That is certainly true, but it is beside the point. Your question was if one can distinguish between the clips made by the computer and made by a human person. I think that this is fundamentally impossible since we do not know anything about the human composer(s), who could have written total garbage.  The survey would have been more straightforward if your question would have been: which of the two clips do you like best.

    Online Survey Testing Music Perception
    Hi everyone! Aside from being a casual composer and long time lurker here, I am a PhD student at Yale University's Department of Computer Science and…


  • Raymond Kemp said:

    Johan,

    I agree with the points you make but.................I treated it as a harmless five minutes to go through them.

    What does a university degree mean these days?     jaw, jaw.

    We better not start on man made  global warming :-)

    Ray, I don´t think that this thing is particularly harmful. I just wanted to point out, that, IMO, there are  fundamental flaws in the setup of this survey.

    Online Survey Testing Music Perception
    Hi everyone! Aside from being a casual composer and long time lurker here, I am a PhD student at Yale University's Department of Computer Science and…
  • The lack of satisfaction in obtainable detail about the experiment is common for this format of study, particularly in a forum setting where participants can discuss it. However, as is common with many psychology-related studies, there are details that, if discovered by any future participants before taking the study, could and probably would bias the results. For example, I'm aware some in-person survey studies where failure to withhold explanations until the forms were collected caused some participants to scratch out their original answers at the last minute in an effort to give what they then thought were the “right” ones before handing in their forms. People will do things like that even when they don't receive any sort of benefit for responding in a particular way. Similar risks would exist for this study if I were to answer some of the questions asked here in detail. To get full detail on a study like this before publication, you pretty much have to either be involved in the study's design as a colleague or be in some sort of official reviewing position like a university's internal review board. Otherwise, for psychology-related experiments, it is the norm for those not directly involved in the experimental design or internal review to have to wait until the results are published to see what was going on behind the scenes in full.

    So far it looks like the study is stable in a technological sense, since it looks like nobody has reported hangs, failed audio, etc. The pilot version of the study will continue to run for a while, but since I may not be able to keep up with the discussion here for a while, I would like to thank everyone so far that has taken the time to look at the study in any amount of detail, particularly those who went all the way through to the end for it. The anonymous comments have also been very helpful.

  • Unless I have misinterpreted the questions, topics like "direction and intent" reads to me as still an inquiry at some level into the hypotheses being tested. I realize it might be hard to see why that is an issue without already knowing the answer, but I've been as specific as I can about subjects like that without risking repeating the problems of those other surveys I mentioned.

  • Stick to your guns Donya, Scrooge is alive and well and loves his dark box.....

    i.e. " It ain't gonna fly Orville."

    You are right, I believe, in keeping it a 'blind' study.  It sounds to me like you

    have a handle on your vision so don't be diverted by the 'experts'.    RS

    Donya Quick said:

    Unless I have misinterpreted the questions, topics like "direction and intent" reads to me as still an inquiry at some level into the hypotheses being tested. I realize it might be hard to see why that is an issue without already knowing the answer, but I've been as specific as I can about subjects like that without risking repeating the problems of those other surveys I mentioned.

    Online Survey Testing Music Perception
    Hi everyone! Aside from being a casual composer and long time lurker here, I am a PhD student at Yale University's Department of Computer Science and…
This reply was deleted.