top of page

AI Research Highlights | Week 2, 2024

Updated: Feb 20


1. LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

In this paper, researchers believe that LLMs with RoPE have a natural ability to handle long texts, even if they haven’t encountered super-long ones during training. The previous limitation stems from out-of-distribution positions, meaning the ”larger” positions haven’t been seen during training. Based on this belief and to address the positional O.O.D. issue, they propose Self-Extend to extend the context window of LLMs without any fine-tuning. This proposal maps the unseen large relative positions (at inference) to known positions (at training), thus it allows LLMs to maintain coherence over longer texts without additional fine-tuning. On both synthetic and real-world long context tasks, Self-Extend can achieve comparable or surprisingly better performance than many existing fine-tuning-based models.

2. Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon

In this work, researchers introduced Activation Beacon for the extension of LLM’s context length. Activation Beacon condenses the LLM’s raw activations into more compact forms, enabling the LLM to perceive a vast context with a limited context window. As a plug-and-play component for the LLM, it brings in long contextual information while fully preserving the LLM’s existing capabilities on short contexts. When dealing with long-sequence data, it resorts to sliding windows for streaming processing, which leads to a superior working efficiency at both inference and training time. With the diversely sampled condensing ratios, it can be effectively learned to support the extensions for a wide scope of context lengths based on short-sequence training data. The experimental study verifies Activation Beacon as an effective, efficient, compatible, low-cost (training) method to extend the context length of LLM.

3. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

This paper introduced a novel fine-tuning method Self-Play Fine-Tuning (SPIN), to convert a weak LLM to a strong LLM by unleashing the full power of human-annotated data. Central to this method is a self-play mechanism, wherein a main player (the LLM) is fine-tuned to differentiate the responses of opponent player (the LLM from previous iteration) from the target data distribution, and the LLM is iteratively aligned with the target data distribution. Therefore, SPIN facilitates the LLM’s iterative self-evaluation and enhancement through self-play. In comparison to supervised fine-tuning and RL fine-tuning methods, SPIN enables the LLM to self-improve without additional human data or feedback from stronger LLMs. Empirical results demonstrate that SPIN significantly enhances LLM performance across diverse benchmarks, even outperforming models trained with additional human data or AI feedback.

4. LLM Augmented LLMs: Expanding Capabilities through Composition

In this work, researchers proposed a novel Composition to Augment Language Models (CALM) framework to address the general model composition setting. Rather than a shallow combination of the augmenting and anchor LMs, CALM introduces a small number of trainable parameters over both augmenting and anchor models’ intermediate layer representations. CALM finds an effective combination of the given models to perform new challenging tasks more accurately than either of the models alone, while preserving the capabilities of individual models. Salient features of CALM are: (i) Scales up LLMs on new tasks by 're-using' existing LLMs along with a few additional parameters and data, (ii) Existing model weights are kept intact, and hence preserves existing capabilities, and (iii) Applies to diverse domains and settings. Augmenting PaLM2-S with a smaller model can result in an absolute improvement of up to 13% on tasks like translation and a relative improvement of 40% on code generation and explanation tasks.

5. Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

In this paper, comprehensive experiments unveil that the bottleneck for poor reflection performance lies in the LLM’s inability to accurately evaluate prior responses. It often manifests as overconfident or inconsistent feedback, hindering the effectiveness of self-reflection. Thus, they advocate Self-Contrast: LLMs create multiple solving perspectives for diverse results, mitigating overconfident biases of a singular prompt. Then drawing inspiration from contrasting different perspectives, LLM summarizes more accurate checking instructions to resolve discrepancies and enhance reflection. Empirically, compared with vanilla reflection, Self-Contrast shows significant improvements and stability in both mathematical reasoning and challenging translation scenarios.

6. TinyLlama: An Open-Source Small Language Model

In this paper, researchers introduced TinyLlama, an open-source, small-scale language model. To promote transparency in the open-source LLM pre-training community, they have released all relevant information, including pre-training code, all intermediate model checkpoints, and the details of data processing steps. With its compact architecture and promising performance, TinyLlama can enable end-user applications on mobile devices, and serve as a lightweight platform for testing a wide range of innovative ideas related to language models. They will leverage the rich experience accumulated during the open, live phase of this project and aim to develop improved versions of TinyLlama, equipping it with a diverse array of capabilities to enhance its performance and versatility across various tasks. Model checkpoints and code are publicly available on GitHub at

7. Mixtral of Experts

Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model, was formally introduced in this paper. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. They also provide a model finetuned to follow instructions, Mixtral 8x7B – Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license. The code can be found at:

8. Agent AI: Surveying the Horizons of Multimodal Interaction

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, researchers define “Agent AI” as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied action with infinite agent. In particular, they explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. They argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, they envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.

9. MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

In this paper, researchers introduced MoE-Mamba, a model that combines Mamba with a Mixture of Experts layer. MoE-Mamba enables efficiency gains of both SSMs and MoE. They also showed that MoE-Mamba acts predictably when the number of experts varies. Experiments confirmed that MoE-Mamba required 2.2x less training steps to achieve the same performance as Mamba and shows potential gains over Transformer and Transformer-MoE. The preliminary results indicated a very promising research direction that may allow scaling SSMs to tens of billions of parameters.

10. GPT-4V(ision) is a Generalist Web Agent, if Grounded

In this paper, researchers propose SeeAct, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. SeeAct with GPT-4V is a strong generalist web agent, if oracle grounding is provided. In online evaluation, it can successfully complete 50% of tasks on different websites, substantially outperforming existing methods like GPT-4 (20%) or FLAN-T5 (18%). This strongly demonstrates the potential of LMMs like GPT-4V for web agents. However, grounding is still a major challenge. The best grounding strategy still has a 20-25% gap with oracle grounding. Among the various grounding strategies, the best one organically leverages both HTML text and visuals, substantially outperforming image annotation strategies by up to 30%. In-context learning with large models (both LMMs and LLMs) show better generalization to unseen websites, while supervised fine-tuning still has an edge on websites seen during training. There is a non-negligible discrepancy between online and offline evaluation because there can often be multiple viable plans for completing the same task. The project can be found at:

*The researchers behind the publications deserve full credit for their work.


bottom of page