top of page

AI Research Highlights | Week 1, 2024

Updated: Feb 19

1. WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

In this paper, researchers proposed a method that could make full use of source code and explicitly control the quality of generated data. Owing to the fact that instruction tuning is to align the pre-training model to the instruction-follow training set, they prensented a LLM Generator-Disciminator framework for instruction data generation. By employing generation and discrimination, this method can make the data generation process more customizable and more controllable. Taking the raw code as input and selecting the core dataset, our method could stably generate more realistic instruction data and control the diversity of data by adjusting the distribution of the raw code. And the focus is on enhancing the performance of code LLMs by applying instruction-tuning. Addressing the aforementioned challenges, they refined the instruction data by classifying the instruction instances to four universal code-related tasks: Code Summarization, Code Generation, Code Translation, and Code Repair and generated a dataset of 20,000 instruction instances, termed CodeOcean, for the four code-related tasks with the data generation strategy. To validate this approach, researchers also introduced WaveCoder models, fine-tuned with their data generation strategy and evaluated it on HumanEval, MBPP, HumanEvalPack benchmarks. Experimental results showed that WaveCoder exhibits exceptional performance based on a small-scale instruction tuning.

2. Supervised Knowledge Makes Large Language Models Better In-context Learners

This paper introduced SuperContext, a versatile and straightforward in-context learning strategy to harness the strength of small models to augment LLMs, particularly focusing on out-of-distribution (OOD) generalization and factuality. At the heart of SuperContext is the integration of SLM outputs representing the supervised knowledge into LLM prompts, exemplified by incorporating the predictive results and confidence of a discriminative model during the LLM’s inference stage. This idea is similar in spirit to existing work on retrieving information from external knowledge bases or API tools, such as unstructured corpora, structured databases, Wikipedia, and Google API. SuperContext is validated on a comprehensive OOD benchmarks GLUE-X, and a QA dataset, SQuAD 2.0. Empirical results showed that the method significantly outperforms LLMs and SLMs with both zero-shot and few-shot settings on 9 distinct tasks using the OOD setting they consider. This work propounds SuperContext as a pioneering approach to systematically integrate SLMs into LLM inference decisions, significantly enhancing LLM performance, especially in managing OOD data and mitigating hallucinations, thereby contributing to the advancement of more generalizable and factual deployment of LLMs.

3. Do Androids Know They're Only Dreaming of Electric Sheep?

In this paper, researchers produced a high-quality dataset of more than 15k utterances with hallucination annotations for organic and synthetic output texts across three grounded generation tasks. They proposed three probe architectures for detecting hallucinations and demonstrated improvements over multiple contemporary baselines in hallucination detection and analyzed how probe accuracy is affected by annotation type (synthetic/organic), hallucination type (extrinsic/intrinsic), model size, and which part of the encoding is probed.

4. SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling

Upstage presented depth up-scaling (DUS), an effective and efficient method to up-scale LLMs whilst also remaining straightforward to use. DUS consists of scaling the base model along the depth dimension and continually pretraining the scaled model. DUS does not scale the model using MoE and rather use a depthwise scaling method analogous to Tan and Le (2019) which is adapted for the LLM architecture. Thus, there are no additional modules or dynamism as with MoE, making DUS immediately compatible with easy-to-use LLM frameworks such as HuggingFace with no changes to the training or inference framework for maximal efficiency. Furthermore, DUS is applicable to all transformer architectures, opening up new gateways to effectively and efficiently scale-up LLMs in a simple manner. Using DUS, they released SOLAR 10.7B, an LLM with 10.7 billion parameters, that outperforms existing models like Llama 2 and Mistral 7B in various benchmarks.

5. Empowering Working Memory for Large Language Model Agents

Traditional LLM agent models lack episodic memory depth and continuity across interaction domains. To address this, an enhanced model is proposed incorporating a centralized Working Memory Hub along with Episodic Buffer access. This equips agents with greater contextual memory during complex sequential tasks and collaborative engagements. The innovative model provides a strategic blueprint for developing LLM agents with more robust and human-like memory capabilities. Further advancements in memory encoding, consolidation, and retrieval mechanisms are imperative to fully actualize these ambitions.

6. Evolving Large Language Model Assistant with Long-Term Conditional Memory

In this paper, researchers proposed to use long-term conditional memory to build an easy-access evolving Large Language Model assistant that can learn experience or knowledge during its dialogue with the human user. To investigate the model in the scenario, they focused on the influence of different forms of memory records and the effective usage of the retrieved memory records in generating the response. A novel approach is proposed to construct conditional memory. Moreover, considering the lack of dataset for evaluation of the framework, researchers also built three test datasets based on different abilities required by an evolving LLM assistant. The experiments further showed the effectiveness of these proposed methods.

7. Improving Text Embeddings with Large Language Models

In this paper, researchers from Microsoft proposed a novel method for text embeddings that leverages LLMs to overcome the limitations of existing approaches. They used proprietary LLMs to generate synthetic data for a diverse range of text embedding tasks in 93 languages, covering hundreds of thousands of embedding tasks. Specifically, They used a two-step prompting strategy that first prompts the LLMs to brainstorm a pool of candidate tasks, and then prompts the LLMs to generate data conditioned on a given task from the pool. To cover various application scenarios, they designd multiple prompt templates for each task type and combined the generated data from different templates to boost diversity. For text embedding models, they opt for fine-tuning powerful open-source LLMs rather than small BERT-style models. When fine-tuned on a mixture of synthetic and labeled data, this model achieves new state-of-the-art results, surpassing previous methods by a significant margin (+2%). The entire training process requires less than 1k steps.

8. Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

In this paper, authors modified Chinchilla scaling laws to account for inference costs, calculating the optimal parameter and training token counts—both in terms of compute and dollar costs—to train and deploy a model of any given quality and inference demand. Their principled derivation estimates that LLM practitioners expecting significant demand (~10^9 inference requests) should train models substantially smaller and longer than Chinchilla-optimal.

9. Fast Inference of Mixture-of-Experts Language Models with Offloading

In this work, researchers systematically developed techniques for running large MoE language models with limited GPU memory. They observed how MoE language model accesses its experts between tokens, and found several regularities: i) some experts are reused between adjacent tokens and ii) the model hidden states of early layers already “know” which experts are to be used at subsequent layers. They designed a MoE-specific offloading strategy that takes advantage of these regularities: i) it uses LRU cache to significantly reduces GPU-RAM communication, leading to faster generation and ii) it guesses which experts are needed ahead of time to better overlap expert loading with computation. They consider the specific scenario of running Mixtral-8x7B-Instruct on a T4, RTX 3060 and RTX 3080 Mobile and develop a practical combination of mixed quantization and the proposed offloading algorithm to run this model interactively at 2-3 tokens per second depending on the hardware. The project can be found here.

*The researchers behind the publications deserve full credit for their work.


bottom of page