Paper ID 407 Paper Title Multitrack Music Transformer Track Name Main Track: Audio and Acoustic Signal Processing Questions 1. Importance/Relevance 3. Of sufficient interest 2. Novelty/Originality 3. Moderately original 3. Technical Correctness 3. Probably correct 4. Experimental Validation 4. Sufficient validation / theoretical paper 5. Clarity of Presentation 4. Very clear 6. Reference to Prior Work 3. References adequate 7. Overall evaluation of this paper 4. Definite accept 8. Detailed assessment of the paper (seen by the authors): This paper contributes in 1) generation of multi-track symbolic music with improved inference speed, and 2) a quantitative measure of musical self-attention. Overall, this paper is very well written. The ideas presented in this paper are all clear and convincing. The insights given by this paper are reusable in future research in music generation. Other comments: 1. Only 9 people participated the subjective test. Is the result reliable? 2. I guess that ideally the average generated sample length should be close to the average clip length in the training dataset. Can this be a evaluation measure? Questions 1. Importance/Relevance 2. Of limited interest 2. Justification of Importance/Relevance Score (required if score is 1 or 2) Reviewer #3 The topic - symbolic music generation using language models, feels a bit out of scope for a signal processing conference 3. Novelty/Originality 3. Moderately original 4. Technical Correctness 3. Probably correct 5. Experimental Validation 4. Sufficient validation / theoretical paper 6. Clarity of Presentation 4. Very clear 7. Reference to Prior Work 3. References adequate 8. Overall evaluation of this paper 3. Marginal accept 9. Detailed assessment of the paper (seen by the authors): This paper presents an approach for generating multitrack music from MIDI using a transformer model. While the topic of symbolic music generation with language models, is a bit of a mismatch for ICASSP, the paper is well- written with nice figures and examples. The proposed technique is computationally cheaper compared to previous work, but these savings come mainly from not having an autoregressive relationship between the different properties (e.g., pitch, duration) of a note. This savings does come at a cost, as the proposed method is outperformed by previous methods in both objective and subjective metrics. The authors also present an approach for analyzing the learned attention matrices, and find the model does learn to attend to regions that correspond to musically relevant pitch intervals and note durations. -Section 3.1 - "our propsoed representation can represent 2.6 and 3.5 times longer music samples.." Since this is considered a major advantage of the proposed method, it would be better if this could be further broken down on what aspects of the representation enable this savings. For example, doesn't a majority of this gain come from reducing the number of MIDI instruments by half? -Section 3.2 - What is meant by the topK sampling strategy? Questions 1. Importance/Relevance 3. Of sufficient interest 2. Novelty/Originality 3. Moderately original 3. Technical Correctness 3. Probably correct 4. Experimental Validation 3. Limited but convincing 5. Clarity of Presentation 3. Clear enough 6. Reference to Prior Work 3. References adequate 7. Overall evaluation of this paper 2. Marginal reject 8. Justification of Overall evaluation of this paper (required if score is 1 or 2) The more interesting part of this paper is the study of the musical self-attention. Unfortunately, this gets very little space in the end of the final section. 9. Detailed assessment of the paper (seen by the authors): This paper is about a transformer-based approach to symbolic music generation. The authors propose a slightly different kind of encoding of the note seuqences as well as using a decoder-only transformer model with multi- dimensional input and ouput. Both objective scores and MOS-ratings show that the method performs slightly worse than state-of-the-art. Still, it is more efficient and can generate longer sequences.