AI vs. Human: Can Large Language Models Outperform Portfolio Managers

As the Gen AI/LLM space continues to heat up, I decided to put several Large Language Models (LLMs) to the test by presenting them with a challenging problem statement: creating millions of portfolio combinations from thousands of equities, backtesting them for efficiency, and deploying the top-performing ones. This task requires advanced reasoning and inference capabilities, making it an ideal candidate for LLMs.

The Problem Statement

Portfolio optimization is a classic problem in finance, where the goal is to create a diversified portfolio that maximizes returns while minimizing risk. With thousands of equities to choose from, the number of possible portfolio combinations is staggering, making it a computationally intensive task. I posed the following question (simplified & not the exact prompt) to the LLMs:

“Design a system to generate millions of portfolio combinations from thousands of equities, backtest them for efficiency, and deploy the top-performing ones.”

The Models

I tested seven different LLMs (with HuggingChat, you can use LM Studio too), each with unique strengths and weaknesses. The models were:

CohereForAI/c4ai-command-r-plus
Meta-llama/Meta-Llama-3–70B-Instruct
HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
Mistralai/Mixtral-8x7B-Instruct-v0.1
Google/gemma-1.1–7b-it
NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
Mistralai/Mistral-7B-Instruct-v0.2

The Results

The most impressive outputs came from Llama 3 and Cohere Command R+, which provided a comprehensive outline design and project schedule. Llama 3’s output included:

Reasonably good design inputs around Data Ingestion & PreProcessing, GNN Model Architecture (with Graph Convolution Network as well as Graph Attention Network), Portfolio Generation & BackTesting, Ranking & Recommendations
A very good scaffolding code with TensorFlow

Cohere Command R+ covered the basics and provided additional inputs around:

Audit Review/Reporting
Risk Analysis
Regulatory Compliance (Explainability, Ethical Considerations)
Documentation

Cohere Command R+ also provided more practical code and usage of packages like StellarGraph.

Comparison and Analysis

While all models provided some level of insight which differentiated themselves, Llama 3 and Cohere Command R+ stood out for their comprehensive and well-structured outputs. Noticeably, Gemma 1.1’s output was more conservative, while Mistral 7B failed to respond through HuggingChat (Possibly being a weekend).

Conclusion

This experiment demonstrates the potential of Large Language Models in tackling complex problems like portfolio optimization. While there is still much to be learned, the outputs from Llama 3 and Cohere Command R+ provide a solid foundation for further development. I plan to test drive these codes and publish my findings in the coming weeks.

Takeaways

LLMs can be used to generate baseline approaches to complex problems like portfolio optimization
Different models have unique strengths and weaknesses, and selecting the right model for the task is crucial
Further development and refinement are necessary to create a production-ready system

I hope this experiment inspires others to explore the capabilities of Large Language Models and their potential applications in finance and beyond.

Original article published by Senthil Ravindran on LinkedIn.

Leave a Comment Cancel Reply