top of page

AI Research Highlights | Week 52, 2023

Updated: Feb 20

1. LLM in a flash: Efficient Large Language Model Inference with Limited Memory

This paper discusses the challenges of efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. It introduces techniques such as "windowing" to strategically reduce data transfer by reusing previously activated neurons and "row-column bundling" to increase the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. The paper also addresses the characteristics of memory storage systems, bandwidth and energy constraints, read throughput, and optimized data management in DRAM and flash memory. It discusses strategies to reduce data transfer, improve transfer throughput with increased chunk sizes, and optimize data management in DRAM.

2. Mini-GPTs: Efficient Large Language Models through Contextual Pruning

This paper examines the application of contextual pruning in creating Mini-GPTs, smaller yet efficient versions of existing LLMs. By analyzing and removing less critical weights specific to different domains, such as law, healthcare, and finance, they aim to maintain or enhance model performance while significantly reducing size and resource usage. The results underscored the efficiency and effectiveness of contextual pruning, not merely as a theoretical concept but as a practical tool in developing domain-specific, resource-efficient LLMs. Contextual pruning is a promising method for building domain-specific LLMs, and this research is a building block towards future development with more hardware compute, refined fine-tuning, and quantization.

3. The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

This paper presents a surprising finding, that careful pruning done at specific layers of Transformer models can produce significant boosts in performance on some tasks. The researchers describe LAyer SElective Rank reduction (LASER), an intervention that removes higher-order components of learned weight matrices as identified by singular value decomposition. This reduction is performed in specific weight matrices and layers of the Transformer model. The results find that many such matrices can be significantly reduced, and that performance degradation is often not observed until well over 90% of components are entirely removed. Even better, this discovery appears to not be limited to natural language, with performance gains also found in reinforcement learning.

4. Retrieval-Augmented Generation for Large Language Models: A Survey

This paper thoroughly explored Retrieval-Augmented Generation (RAG), a technique that uses an external knowledge base to supplement the context of Large Language Models (LLMs) and generate responses. RAG’s development and characteristics are summarized into three paradigms: Naive RAG, Advanced RAG, and Modular RAG, each with its models, methods, and shortcomings. Naive RAG primarily involves the ’retrieval-reading’ process. Advanced RAG uses more refined data processing, optimizes the knowledge base indexing, and introduces multiple or iterative retrievals. As exploration deepens, RAG integrates other techniques like fine-tuning, leading to the emergence of the Modular RAG paradigm, which enriches the RAG process with new modules and offers more flexibility. It then provides a summary and organization of the three main components of RAG: retriever, generator, and augmentation methods, along with key technologies in each component. Furthermore, it discusses how to evaluate the effectiveness of RAG models, introducing two evaluation methods for RAG, emphasizing key metrics and abilities for evaluation, and presenting the latest automatic evaluation framework. The resources can be found at:

5. A Survey of Reasoning with Foundation Models

In this paper, the researchers introduced seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. They then delved into the potential future directions behind the emergence of reasoning abilities within foundation models. They also discussed the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. They also maintain a continuously updated reading list to benefit future research, featuring relevant papers and popular benchmarks on reasoning. GitHub:

6. Learning to Act without Actions

This work introduced Latent Action Policies from Observation (LAPO), a method for training policies over a learned latent action space, inferred from purely observational data. Unlike prior work on imitation learning from observation, LAPO does not rely on access to the true action space or a predefined set of discrete latent actions to learn a useful, pretrained policy. Instead, LAPO learns a latent action space end-to-end, by optimizing an unsupervised objective based on predictive consistency between an inverse and a forward dynamics model. Vector quantization of the continuous latent actions induces an information bottleneck that forces the quantized actions to capture state-invariant transition information. Across all 16 games of the challenging Procgen Benchmark, results showed that this approach can learn latent action spaces that reflect the structure of the true action spaces, despite LAPO never having access to the latter. Researchers then demonstrated that latent action policies, obtained through behavior cloning of latent actions, can serve as useful pretrained models that can be rapidly adapted to recover or even exceed the performance of the expert that generated the original action-free dataset.

7. Challenges with unsupervised LLM knowledge discovery

Deepmind scientists concluded in this paper that existing unsupervised methods for discovering latent knowledge are insufficient in practice, and they contributed sanity checks to apply to evaluating future knowledge elicitation methods. The conclusions may generalise to more sophisticated methods, though perhaps not the exact experimental results: they think that unsupervised learning approaches to discovering latent knowledge which uses some consistency structure of knowledge will likely suffer from issues. Even more sophisticated methods searching for properties associated with a model’s knowledge seem to be likely to encounter false positives such as “simulations” of other entities’ knowledge.

8. AppAgent: Multimodal Agents as Smartphone Users

Researchers from Tencent open-sourced a multimodal agent framework, focusing on operating smartphone applications with our developed action space. They proposed an innovative exploration strategy, which enables the agent to learn to use novel apps. Through extensive experiments across multiple apps, they validated the advantages of the framework, demonstrating its potential in the realm of AI-assisted smartphone app operation. The project can be found at:

9. Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning

The researchers from Huawei presented the Pangu-Agent framework to facilitate future research towards developing generalist AI agents. Pangu-Agent builds upon LLMs to address reasoning and decision problems, which allows utilising human priors. First, they propose a general RL-based objective to optimise the agent’s intrinsic and extrinsic functions. They implemented several intrinsic functions and showcased the modular functionality of our agent and how it can support recent advances in LLM research. They extensively evaluated Pangu-Agent in several single-agent and multi-agent tasks for different prompt engineering methods across various LLMs, and offered insights about their relative advantages and disadvantages. Moreover, they discussed how this framework can fine-tune LLMs through an SFT and RL pipeline. The results indicated that fine-tuning can improve the agent’s performance up to threefold in specific multi-step decision-making domains such as ALFWorld and Baby-AI.

*The researchers behind the publications deserve full credit for their work.


bottom of page