Meta Review
This paper proposes a new strategy for in-context learning. The idea is to use the final-token hidden states of in-context examples to create one cluster for each label. The test example can then be mapped to the closest cluster centroid to perform classification. The results show that this outperforms other methods like batch calibration that all rely on the token probabilities instead of hidden states.
The reviewers are generally in agreement that the method is simple and effective, the experiments are thorough, and the paper is well-written. One reviewer raised some concerns about novelty, which I think are not serious. The reviewers have identified a weakness in the method’s need for additional data, but the paper does contain some analysis of data efficiency and the authors will further strengthen that section by including vanilla ICL with different numbers of examples.
Summary Of Reasons To Publish:
Summary Of Suggested Revisions:
Overall Assessment: 4 = There are minor points that may be revised
Suggested Venues: NAACL
Best Paper Ae: No
Ethical Concerns:
There are no concerns with this submission
Needs Ethics Review: No
Author Identity Guess: 1 = I do not have even an educated guess about author identity.
Great Reviews: bczJ
Reported Issues: No
Paper Summary:
The paper identified the problem of potential bias brought by the output tokens (e.g., positive/negative) for classifications in the decoding process of LLMs when prompted to perform In-context Learning (ICL). By introducing Hidden Calibration (HC), a clustering method over the last-layer hidden states of LLMs, the authors presented a more effective decision boundary for classification. The authors then empirically tested the method across various dataset&LLM combinations, which demonstrated improvements over several calibration methods. Additionally, the author conducted analysis to show the hidden states of LLMs are linearly separable under certain conditions, and that clustering produce a more robust decision boundary.
Summary Of Strengths:
Summary Of Weaknesses:
Comments Suggestions And Typos:
Confidence: 5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.
Soundness: 3.5
Overall Assessment: 2.5
Best Paper: No
Ethical Concerns:
There are no concerns with this submission
Needs Ethics Review: No
Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Knowledge Of Or Educated Guess At Author Identity: No
Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources
Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources
Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources
Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.
Paper Summary:
The paper proposes a simple yet effective calibration method for ICL on classification tasks by focusing on the last hidden state of the last tokens in the query. Specifically, it calculates the centroid of the hidden states for each label using a calibration set offline. It assigns a test example to the label with the closest centroid for the hidden state at inference. The papers show their methods outperform previous methods across 10 classification datasets. The paper includes a detailed analysis of the time/data efficiency of their methods and demonstrates the robustness of their methods by varying important aspects of ICL, such as # demonstrations, the ordering of examples, etc. The papers also include a detailed analysis of the constraints of previous calibration methods along with the side evidence of the effectiveness of their methods.
Summary Of Strengths:
Summary Of Weaknesses:
Comments Suggestions And Typos:
I am aware that there are comparisons of data efficiency included in the paper already, but I would still suggest the authors compare their methods against the baseline methods, like in Figure 4, but with the x-axis being the # total used data samples. It is still important to show directly the improvement of the methods controlling the data efficiency besides the current time efficiency.
Confidence: 5 = Positive that my evaluation is correct. I read the paper very carefully and am familiar with related work.
Soundness: 3.5
Overall Assessment: 3.5
Best Paper: No
Ethical Concerns:
There are no concerns with this submission
Needs Ethics Review: No
Reproducibility: 5 = They could easily reproduce the results.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Knowledge Of Or Educated Guess At Author Identity: No
Knowledge Of Paper: Before the review process
Knowledge Of Paper Source: other (specify)
Knowledge Of Paper Source Other: Reviewed in previous cycle.
Impact Of Knowledge Of Paper: Not much
Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.
Paper Summary: This paper proposes a novel few-shot learning method using LLMs. They suggest constructing the decision boundary directly in the hidden representation space of the model before un-embedding instead of the probability space after softmax. Specifically, during the learning phase, they (1) encode each of the calibration examples with some in-context examples and then (2) find the centroid of each class’s representations. During the inference time, they encode the test sample in the same way and choose the class whose centroid is closest to it as the prediction.
They show that
They also conduct analyses:
Summary Of Strengths:
Summary Of Weaknesses:
I do not see major issues in this paper. There are some issues that are relatively minor:
High-level comments
Clarity Issue
While the clarity has been improved a lot compared to the previous version (especially for the main methodology part), there are still parts that are somewhat confusing, mostly in the analysis sections.
Comments Suggestions And Typos:
Why does the inter-class centroid distance decrease when there are more examples. It seems to contradict the observation in Section 5.1 that the classes are more separable when there are more in-context examples?
Confidence: 4 = Quite sure. I tried to check the important points carefully. It’s unlikely, though conceivable, that I missed something that should affect my ratings.
Soundness: 4 = Strong: This study provides sufficient support for all of its claims/arguments. Some extra experiments could be nice, but not essential.
Overall Assessment: 3.5
Best Paper: No
Ethical Concerns:
There are no concerns with this submission
Reproducibility: 4 = They could mostly reproduce the results, but there may be some variation because of sample variance or minor variations in their interpretation of the protocol or method.
Datasets: 1 = No usable datasets submitted.
Software: 1 = No usable software released.
Knowledge Of Or Educated Guess At Author Identity: No
Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources
Knowledge Of Paper Source: N/A, I do not know anything about the paper from outside sources
Impact Of Knowledge Of Paper: N/A, I do not know anything about the paper from outside sources
Reviewer Certification: I certify that the review I entered accurately reflects my assessment of the work. If you used any type of automated tool to help you craft your review, I hereby certify that its use was restricted to improving grammar and style, and the substance of the review is either my own work or the work of an acknowledged secondary reviewer.