===== METAREVIEW ===== Paper ID291 Paper TitleLakh NES: Improving multi-instrumental music generation with cross-domain pre-training Track NamePapers META-REVIEWER #2 META-REVIEW QUESTIONS 2. The title and abstract reflect the content of the paper. Strongly agree 3. The paper discusses, cites and compares with all relevant related work. Strongly agree 4. The writing and language are clear and structured in a logical manner. Strongly agree 5. The references are well formatted. Yes 6. The topic of the paper is relevant to the ISMIR community. Strongly agree 7. The content is scientifically correct. Strongly agree 8. The paper provides novel methods, findings or results. Strongly agree 9. The paper will have a large influence/impact on the future of the ISMIR community. Agree 10. The paper provides all the necessary details or material to reproduce the results described in the paper. Agree 11. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 12. Please explain your assessment of reusable insights in the paper. While the focus of the paper is on the surface somewhat narrow (generating chiptunes), the approach of transferring a generative model from a large, diverse collection to a smaller, more constrained setting is well-executed and should be instructive across different areas in MIR. 17. Main review and comments for the authors. === Meta-review === All reviewers found the paper to be exceptionally well-written and well-executed. While some reviewers did note a few weaknesses in the evaluation and baselines, the overall consensus was to accept. === Initial review === This paper describes a transfer-based approach to generative modeling of polyphonic, symbolic music. The proposed method uses a transformer-based model, fit to the Lakh midi set and fine-tuned on NES-MDB to leverage both large quantities of diverse music, as well as a smaller quantity of more constrained music. The proposed method is evaluated against weak and strong baselines, using both quantitative and qualitative methods, and performs favorably while showing that there's still room for improvement. Overall, I enjoyed this paper. The writing is clear, and the authors do a good job of presenting a relatively complex method in a limited amount of space. The authors are also honest about the limitations of the proposed method. The linked audio examples sound compelling as well. ===== REVIEWS ===== Paper ID291 Paper TitleLakh NES: Improving multi-instrumental music generation with cross-domain pre-training Track NamePapers Reviewer #1 Questions 2. The title and abstract reflect the content of the paper. Agree 3. The paper discusses, cites and compares with all relevant related work. Disagree 4. The writing and language are clear and structured in a logical manner. Strongly agree 5. The references are well formatted. Yes 6. The topic of the paper is relevant to the ISMIR community. Strongly agree 7. The content is scientifically correct. Agree 8. The paper provides novel methods, findings or results. Disagree 9. The paper will have a large influence/impact on the future of the ISMIR community. Agree 10. The paper provides all the necessary details or material to reproduce the results described in the paper. Agree 11. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 12. Please explain your assessment of reusable insights in the paper. One thing that could be considered an insight is the improvement in performance of a generative model by pretraining it on a very large dataset generated from a large dataset by the application of heuristics under minimum assumptions. However, this is a very minor one in my opinion as it is only confirming what is mostly common knowledge these days - that pre-training a model on data relevant to a task (but not the task itself) beforehand can lead to signification improvements if the two datasets are related enough. 17. Main review and comments for the authors. Overall: * Abstract is, more or less, clear and covers the important points of the paper. It would be better if the Lakh MIDI Dataset was specifically mentioned in it. * The introduction is mostly fine except one statement that didn’t make much sense to me. * There are a few non-relevant references that could be replaced with more relevant ones. * A few additional minor details would be good in the datasets section. Otherwise, it looks OK. * The experimental evaluation is quite broad and well thought out, which is a good thing. I would, however, have liked to see certain more details such as those I mentioned below in the detailed review. * The paper overall seems to confirm what I would consider common knowledge in the context of a specific application (music generation) i.e. that pre-training with a massive dataset helps improve performance in a task of interest. The overall novelty is minimal in that mostly standard data augmentation techniques have been employed to create a massive pre-training dataset that helps in improving the model performance. Although, what I find interesting in the fairly minimal assumptions made in the heuristic-based augmentation of the Lakh MIDI Data. Abstract: * My one issue with the abstract is that the Lakh MIDI Dataset is not explicitly mentioned although it is an important aspect of the work. Please mention it, as it might mislead the reader to think that you’re proposing a new dataset yourself. For instance, something along the lines of, “... a large collection of heterogeneous music, namely the Lakh MIDI Dataset.” would be helpful. Introduction * It is not very clear to me what is meant by “... we incorporate the semantics of the instruments directly into our language-like representation.” Perhaps this could be rephrased in simpler language at this early stage of the paper? Related Work * References [21-23] (which pertain to audio music generation) not really relevant to this paper (which pertains to symbolic music generation). I would skip them. The authors could consider citing other fairly recent work in symbolic music generation such as the MusicVAE paper (Arxiv 2018), StructureNet paper (ISMIR 2018), the MuseGAN paper (AAAI 2018) and the MIDINet (ISMIR 2017) paper. In my opinion those are more relevant here, and I would strongly recommend including them. Datasets and Task * Is it the case that where there are time-synchronous events, they are ordered by part (P1, P2, TR, NO or some other order)? Unless I missed this, the information isn’t in the text. * Why do the authors stop at a maximum of 16 outputs per input example? Experiments * I hope the authors also plan to release details (code, models) pertaining to the baselines as well. It’s not mentioned anywhere. * Could the authors mention the type of back-off algorithm used for the 5-gram models? * There’s also very little information on the baseline LSTM model. Please add some more details of the architecture and how it was arrived at. * Just for the sake of completion, would it be possible to state how much improvement there is in the LSTM performance on the task with Lakh MIDI pre-training? * Please explain on what basis was the duration of the tracks chosen in each of the two user studies. I would have expected consistent track durations across the two. Why is a 10 second clip more suitable for a preference test, while a 5 second one for a Turing test? * I think that the musical experience of the users in the user study is important. Could the authors comment on why they didn’t consider that in the paper? * Was any model selection carried out during evaluation of the models? Please include at least some details of the procedure employed. Reviewer #2 Questions 2. The title and abstract reflect the content of the paper. Strongly agree 3. The paper discusses, cites and compares with all relevant related work. Agree 4. The writing and language are clear and structured in a logical manner. Strongly agree 5. The references are well formatted. Yes 6. The topic of the paper is relevant to the ISMIR community. Strongly agree 7. The content is scientifically correct. Agree 8. The paper provides novel methods, findings or results. Agree 9. The paper will have a large influence/impact on the future of the ISMIR community. Agree 10. The paper provides all the necessary details or material to reproduce the results described in the paper. Agree 11. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 12. Please explain your assessment of reusable insights in the paper. This paper presents a heuristic-based methodology to map instruments from a larger heterogeneous dataset to a smaller ensemble. This can be used to pre-train music generation models on the larger dataset before fine-tuning it a smaller more homogenous dataset. The authors show that such pre-training improves the performance of the model. This mapping mechanism can be applied to other multi-instrument / multi-voice generation tasks encouraging researchers to explore transfer learning for their research. 17. Main review and comments for the authors. This paper presents a method for multi-instrumental music generation in the symbolic domain. The authors apply a previously proposed transformer-based architecture [Dai et al., Transformer-XL, 2019] to the NES multi-instrument symbolic music dataset [Donahue et al., ISMIR, 2018] which is composed of 4-instrument chiptunes. In order to accomplish this task, they extend the data representation used by the PerformanceRNN framework [Simon and Oore, 2017] to the multi-instrument case. The quantitative and qualitative results show that the proposed architecture outperforms the baselines in both quantitative and qualitative evaluations. The presented audio examples are convincing. However, the baselines considered (n-gram models and LSTM) have already been shown to be inferior to transformer-based architectures in previous research [Huang et al, ICLR, 2018]. Hence, improving upon the baselines by using the transformer architecture cannot be considered as a major technical contribution. The main contribution of the paper, however, is the heuristic-based methodology that the authors propose to map instruments from a larger heterogeneous dataset to a smaller ensemble. They use this to first pre-train their model on the larger Lakh MIDI dataset followed by model fine-tuning on the NES dataset. This results in both quantitive and qualitative improvements in the model performance. The potential use of this methodology in future music generation research makes a strong case for the paper. The paper is extremely well-written, well-structured and presents its ideas very nicely. With the availability of the code, certain details with regards to architecture and model configurations would also be available to the readers. Few comments for the authors: - Considering that the transformer architecture (concepts of multi-head attention) is never really discussed in the paper, mentioning the number of attention layers and heads in Section 5 might be a bit confusing for some readers. - As a side note, in the related works, it might be worthwhile to also mention cases where transfer learning has been useful for other MIR tasks [Choi et al, 2017] Reviewer #3 Questions 2. The title and abstract reflect the content of the paper. Strongly agree 3. The paper discusses, cites and compares with all relevant related work. Agree 4. The writing and language are clear and structured in a logical manner. Strongly agree 5. The references are well formatted. Yes 6. The topic of the paper is relevant to the ISMIR community. Strongly agree 7. The content is scientifically correct. Strongly agree 8. The paper provides novel methods, findings or results. Agree 9. The paper will have a large influence/impact on the future of the ISMIR community. Agree 10. The paper provides all the necessary details or material to reproduce the results described in the paper. Agree 11. The paper provides reusable insights (i.e. the capacity to gain an accurate and deep understanding). Such insights may go beyond the scope of the paper, domain or application, in order to build up consistent knowledge across the MIR community. Agree 12. Please explain your assessment of reusable insights in the paper. Event-based representation can be used by future music generation models. Available pre-trained models and code mean downstream tasks are possible as well as straightforward extensions to the model. 17. Main review and comments for the authors. SUMMARY The paper applies the Transformer-XL model to multi-instrument symbolic music generation on the NES database. REVIEW The paper is very clear and readable thanks to its excellent writing. It also provides high-quality figures as well as sound examples, making for a very good presentation overall. Since the Transformer models used are not novel and were applied to symbolic music generation before, the main contributions of this paper are: 1) Application to multi-instrument generation 2) Proprosing a pre-training method to train from MIDI data with different instrument compositions I think that both of the above are noteworthy contributions, although I take slight issue with the latter: While the pre-training scheme 2) brings significant performance improvement and might be transferable to other generation work, it is not very elegant as it requires very application-specific adaptation of the instrument annotations, which in turn makes it more cumbersome to reuse this principle in other generation tasks. Choosing this approach also seems unnecessarily complicated when one could also pre-train on the MIDI data normally with a fixed selection of instruments, using an output layer that can emit events for all of them, before fine-tuning by replacing the output layer with a new output layer that has the required number of event probability outputs. This would allow most of the model to be pre-trained, and removes the need for careful manual label adaptations, and is also fairly standard practice in computer vision applications, so I am interested why a simpler approach was not chosen instead. While choosing an event-based representation of music to get a more compact output space and less long-term dependencies compared to a quantised piano roll appears sound, it was not explained how simultaneous instrument on/off events are ordered in the sequence? This is important, as establishing a one-to-one mapping between the original to the event representation avoids the model being unnecessarily uncertain due to an undefined event order. The paper is quite light on the model and training details, but promises to release code and pre-trained models which would be a sufficient alternative for describing the exact approach. The evaluation is overall rigorous and extensive, using both perplexity values as well as multiple perceptual listening tests. For the Turing test, it might have been worthwhile to have a fixed number of examples for each participant to control for potential training effects where participants get better throughout their session at distinguishing the fake from the real examples. This is because longer sessions per participant might increase accuracy overall. Alternatively, this issue should be noted as potential confounding factor in the paper. Although there is no direct quantiative comparison to prior work, the paper makes very good use of multiple baseline models of varying complexity along with ablation analysis (by successively adding data augmentation and pre-training to the Transformer and checking performance), which greatly alleviates this shortcoming. Overall, the paper does not offer much technical contributions as it reuses existing models and the pretraining scheme is quite application-specific. However, the paper has great presentation, a solid empirical contribution showing the potential of Transformer models for multi-instrument music generation on a large scale as well as the empirical benefits of cross-domain pretraining, combined with quite rigorous evaluation, so I recommend accepting it. MINOR POINTS - Explanation of the event codes at bottom of Figure 1? Maybe integrate the codes into Table 1 - Explain what time resolution the medium and long-term time shift events have - "TXL" is not defined - What is the number of layers in LSTM? Number of hidden units in each layer? - Section 6 "LakhNES (Transformer-XL pre-trained on Lakh MIDI and fine-tuned on NES-MDB)" - "with augmentation" needs to be added, or just assume as already known as it was defined earlier - since this description is repeated often throughout the paper - "We first amass a collection of 5-second audio clips from all of our methods" - please specify more clearly how this is done, since the audio clips are of different lengths - do you simply take the first 5 seconds of each audio clip, or a random section? - Section 7.1 - make clear which exact four methods are being compared in the Turing test study (e.g. if they include data augmentation) - Section 7.1 - Please state which statistical test was used to arrive at the given p values - Specify what the error bars in Figures 4 and 5 indicate