Publications
2026
-
Predicting a target agent’s next decision from K prior games. (A) LLM-as-Predictor directly prompts a large LLM for the decision. (B) A text-tabular formulation feeds game features and text representations to a tabular foundation model (TabPFN). (C) Our model adds LLM-as-Observer: the hidden state of a small frozen LLM that reads the game state and dialogue becomes an additional decision-oriented representation. Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular ModelingEilam Shapira, Moshe Tennenholtz, and Roi ReichartarXiv preprint arXiv:2605.12411, 2026AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart’s LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart’s next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while K previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at K=16, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.
@article{shapira2026predictingdecisionsaiagents, title = {Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling}, author = {Shapira, Eilam and Tennenholtz, Moshe and Reichart, Roi}, year = {2026}, eprint = {2605.12411}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2605.12411}, journal = {arXiv preprint arXiv:2605.12411}, } -
The MulTaBench Curation Pipeline. Datasets are included if joint prediction outperforms unimodal baselines and if Target-Aware Representations improve on frozen, off-the-shelf embeddings. MulTaBench: Benchmarking Multimodal Tabular Learning with Text and ImageAlan Arazi*, Eilam Shapira*, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, +7 more authors, and Roi ReichartarXiv preprint arXiv:2605.10616, 2026Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
@article{arazi2026multabenchbenchmarkingmultimodal, title = {MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image}, author = {Arazi, Alan and Shapira, Eilam and Grunblat, Shoham and Ventura, Mor and Hoffer, Elad and Blayer, Gioia and Holzmüller, David and Purucker, Lennart and Varoquaux, Gaël and Hutter, Frank and Reichart, Roi}, year = {2026}, eprint = {2605.10616}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2605.10616}, journal = {arXiv preprint arXiv:2605.10616}, } - STRABLE: Benchmarking Tabular Machine Learning with StringsGioia Blayer, Myung Jun Kim, Félix Lefebvre, Lennart Purucker, Alan Arazi, +3 more authors, Eilam Shapira, Roi Reichart, Frank Hutter, Marine Le Morvan, David Holzmüller, +4 more authors, and Gaël VaroquauxarXiv preprint arXiv:2605.12292, 2026
Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.
@article{blayer2026strablebenchmarkingtabular, title = {STRABLE: Benchmarking Tabular Machine Learning with Strings}, author = {Blayer, Gioia and Kim, Myung Jun and Lefebvre, Félix and Purucker, Lennart and Arazi, Alan and Shapira, Eilam and Reichart, Roi and Hutter, Frank and Morvan, Marine Le and Holzmüller, David and Varoquaux, Gaël}, year = {2026}, eprint = {2605.12292}, archiveprefix = {arXiv}, primaryclass = {cs.LG}, url = {https://arxiv.org/abs/2605.12292}, journal = {arXiv preprint arXiv:2605.12292} } -
Pearson correlations of base models and human decisions (x-axis) vs. aligned models and human decisions (y-axis) across four game families. Points below the diagonal indicate base advantage. Alignment Makes Language Models Normative, Not DescriptiveEilam Shapira, Moshe Tennenholtz, and Roi ReichartarXiv preprint arXiv:2603.17218, 2026Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.
@article{shapira2026alignmentmakeslanguagemodels, title = {Alignment Makes Language Models Normative, Not Descriptive}, author = {Shapira, Eilam and Tennenholtz, Moshe and Reichart, Roi}, year = {2026}, eprint = {2603.17218}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2603.17218}, journal = {arXiv preprint arXiv:2603.17218}, } -
(a) TabSchema compiles trajectory-derived schema, state, and dependency signals into tabular features. (b) TabHead consumes features and candidates to output calibrated probabilities. TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual ClassifiersIdo Levy*, Eilam Shapira*, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, +2 more authors, and Segev ShlomovarXiv preprint arXiv:2602.16429, 2026Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.
@article{levy2026tabagentframeworkreplacingagentic, title = {TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers}, author = {Levy, Ido and Shapira, Eilam and Goldshtein, Yinon and Yaeli, Avi and Mashkif, Nir and Shlomov, Segev}, year = {2026}, eprint = {2602.16429}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2602.16429}, journal = {arXiv preprint arXiv:2602.16429}, } - Textual Planning with Explicit Latent TransitionsEliezer Shlomi, Ido Levy, Eilam Shapira, Michael Katz, Guy Uziel, Segev Shlomov, Nir Mashkif, Roi Reichart, +5 more authors, and Sarah Keren2026
Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain’s transitions, while transfer across domain boundaries remains a bottleneck.
@misc{shlomi2026textualplanningexplicitlatent, title = {Textual Planning with Explicit Latent Transitions}, author = {Shlomi, Eliezer and Levy, Ido and Shapira, Eilam and Katz, Michael and Uziel, Guy and Shlomov, Segev and Mashkif, Nir and Reichart, Roi and Keren, Sarah}, year = {2026}, eprint = {2602.04557}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2602.04557}, journal = {arXiv preprint arXiv:2602.04557} } -
Illustration of the "poisoned apple" example, in which Alice increases her payoff at Bob’s expense by releasing a new technology—without the players actually using that technology in practice. The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI AgentsEilam Shapira, Roi Reichart, and Moshe TennenholtzarXiv preprint arXiv:2601.11496, 2026The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the ”Poisoned Apple” effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator’s choice of market design in their favor. This strategic release improves the releaser’s welfare at the expense of their opponent and the regulator’s fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.
@article{shapira2026poisonedappleeffectstrategic, title = {The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents}, author = {Shapira, Eilam and Reichart, Roi and Tennenholtz, Moshe}, year = {2026}, eprint = {2601.11496}, archiveprefix = {arXiv}, primaryclass = {cs.GT}, journal = {arXiv preprint arXiv:2601.11496}, url = {https://arxiv.org/abs/2601.11496}, } - From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise ProductionSegev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, +7 more authors, Eilam Shapira, Nir Mashkif, and Asaf AdiProceedings of the AAAI Conference on Artificial Intelligence, Mar 2026
Agents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across tasks, applications, and modalities. Yet, evidence of their use in enterprise settings remains limited. This paper reports IBM’s experience developing and piloting the Computer Using Generalist Agent (CUGA). CUGA adopts a hierarchical planner–executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was evaluated in a Business-Process-Outsourcing talent acquisition pilot, addressing enterprise requirements for scalability, auditability, safety, and governance. In preliminary evaluations, CUGA approached the accuracy of specialized agents while suggesting reductions in development time and cost. We provide early evidence that generalist agents can operate at enterprise scale, distill key technical and organizational lessons, and outline requirements for transitioning research-grade architectures like CUGA into enterprise-ready systems.
@article{shlomov2026benchmarks, title = {From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production}, author = {Shlomov, Segev and Oved, Alon and Marreed, Sami and Levy, Ido and Akrabi, Offer and Yaeli, Avi and Str{\k{a}}k, {\L}ukasz and Koumpan, Elizabeth and Goldshtein, Yinon and Shapira, Eilam and Mashkif, Nir and Adi, Asaf}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {40}, number = {47}, pages = {40423--40431}, year = {2026}, month = mar, doi = {10.1609/aaai.v40i47.41485}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/41485}, }
2025
- Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-TuningKajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart, and Anna KorhonenarXiv preprint arXiv:2511.13368, 2025
Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages remains poorly understood. We conduct a controlled LoRA fine-tuning study across multiple open-weight LLM families and scales, using a standardised grid of 11 languages and four benchmarks. We fine-tune each model on a single task-language source and measure transfer when evaluated on all other task-language target pairs. We decompose transfer into three regimes: (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language). Single-source fine-tuning yields a net positive uplift across regimes, but the gains are strongly asymmetric. Matched-Task (Cross-Language) transfer emerges as the most effective and predictable regime, driven principally by the identity of the target language rather than model architecture. We identify a stable hierarchy where high-resource languages and broad semantic tasks act as efficient recipients that absorb gains from diverse sources, while specialised tasks and lower-resource languages are more isolated. These results imply that effective fine-tuning requires navigating donor-recipient roles to maximise downstream gains.
@article{dymkiewicz2025donors, title = {Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning}, author = {Dymkiewicz, Kajetan and Vulic, Ivan and Yannakoudakis, Helen and Shapira, Eilam and Reichart, Roi and Korhonen, Anna}, year = {2025}, eprint = {2511.13368}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2511.13368}, journal = {arXiv preprint arXiv:2511.13368} } - NeurIPS
The TabSTAR architecture illustrated with our toy dataset. The model processes numerical features, textual features, and all possible target values for classification. TabSTAR: A Tabular Foundation Model for Tabular Data with Text FieldsAlan Arazi, Eilam Shapira, and Roi ReichartAdvances in Neural Information Processing Systems, 2025While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.
@inproceedings{arazi2025tabstar, title = {Tab{STAR}: A Tabular Foundation Model for Tabular Data with Text Fields}, author = {Arazi, Alan and Shapira, Eilam and Reichart, Roi}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Belgrave, D. and Zhang, C. and Lin, H. and Pascanu, R. and Koniusz, P. and Ghassemi, M. and Chen, N.}, pages = {172108--172161}, publisher = {Curran Associates, Inc.}, volume = {38}, year = {2025}, url = {https://proceedings.neurips.cc/paper_files/paper/2025/file/faf6e23e198314c7728eaa6ac44ae079-Paper-Conference.pdf}, } - Fairness under CompetitionRonen Gradwohl, Eilam Shapira, and Moshe TennenholtzAdvances in Neural Information Processing Systems, 2025
Algorithmic fairness has emerged as a central issue in ML, and it has become standard practice to adjust ML algorithms so that they will satisfy fairness requirements such as Equal Opportunity. In this paper we consider the effects of adopting such fair classifiers on the overall level of ecosystem fairness. Specifically, we introduce the study of fairness with competing firms, and demonstrate the failure of fair classifiers in yielding fair ecosystems. Our results quantify the loss of fairness in systems, under a variety of conditions, based on classifiers’ correlation and the level of their data overlap. We show that even if competing classifiers are individually fair, the ecosystem’s outcome may be unfair; and that adjusting biased algorithms to improve their individual fairness may lead to an overall decline in ecosystem fairness. In addition to these theoretical results, we also provide supporting experimental evidence. Together, our model and results provide a novel and essential call for action.
@inproceedings{gradwohl2025fairness, title = {Fairness under Competition}, author = {Gradwohl, Ronen and Shapira, Eilam and Tennenholtz, Moshe}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Belgrave, D. and Zhang, C. and Lin, H. and Pascanu, R. and Koniusz, P. and Ghassemi, M. and Chen, N.}, pages = {53978--54004}, publisher = {Curran Associates, Inc.}, volume = {38}, year = {2025}, url = {https://proceedings.neurips.cc/paper_files/paper/2025/file/4dd0a016d7d253d02473e4778414ab0b-Paper-Conference.pdf}, } - Human Choice Prediction in Language-based Persuasion Games: Simulation-based Off-Policy EvaluationTransactions of the Association for Computational Linguistics, Aug 2025
Recent advances in Large Language Models (LLMs) have spurred interest in designing LLM-based agents for tasks that involve interaction with human and artificial agents. This paper addresses a key aspect in the design of such agents: predicting human decisions in off-policy evaluation (OPE). We focus on language-based persuasion games, where an expert aims to influence the decision-maker through verbal messages. In our OPE framework, the prediction model is trained on human interaction data collected from encounters with one set of expert agents, and its performance is evaluated on interactions with a different set of experts. Using a dedicated application, we collected a dataset of 87K decisions from humans playing a repeated decision-making game with artificial agents. To enhance off-policy performance, we propose a simulation technique involving interactions across the entire agent space and simulated decision-makers. Our learning strategy yields significant OPE gains, e.g., improving prediction accuracy in the top 15% challenging cases by 7.1%.1
@article{shapira2025human, author = {Shapira, Eilam and Madmon, Omer and Apel, Reut and Tennenholtz, Moshe and Reichart, Roi}, title = {Human Choice Prediction in Language-based Persuasion Games: Simulation-based Off-Policy Evaluation}, journal = {Transactions of the Association for Computational Linguistics}, volume = {13}, pages = {980-1006}, year = {2025}, month = aug, issn = {2307-387X}, doi = {10.1162/TACL.a.16}, url = {https://doi.org/10.1162/TACL.a.16}, eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/TACL.a.16/2549171/tacl.a.16.pdf}, } - Transition Function Prediction in AI Planning Using LLMsEliezer Shlomi, Guy Azran, Eilam Shapira, Omer Nahum, Roi Reichart, Guy Uziel, Michael Katz, Ateret Anaby Tavor, +5 more authors, and Sarah KerenAAAI 2025 Workshop LM4Plan, 2025
Large Language Models (LLMs) excel in natural language processing tasks but face significant challenges in classical planning, where accurate and feasible transitions between states are required. This work presents a novel approach that embeds states and actions into a structured feature space, using a shallow neural network as a transition function. By performing planning in the latent space, the method significantly reduces the computational cost associated with frequent LLM calls while retaining logical consistency. We evaluate this framework as a classifier and demonstrate promising results in state transition prediction and planning tasks across natural language-described domains. The approach offers insights into efficient and scalable LLM-based planning, bridging the gap between natural language understanding and practical planning systems.
@inproceedings{shlomi2025transition, title = {Transition Function Prediction in {AI} Planning Using {LLM}s}, author = {Shlomi, Eliezer and Azran, Guy and Shapira, Eilam and Nahum, Omer and Reichart, Roi and Uziel, Guy and Katz, Michael and Tavor, Ateret Anaby and Keren, Sarah}, booktitle = {AAAI 2025 Workshop LM4Plan}, year = {2025}, url = {https://openreview.net/forum?id=SPRYlWKaPj}, }
2024
-
GLEE: A Unified Framework and Benchmark for Language-based Economic EnvironmentsEilam Shapira, Omer Madmon, Itamar Reinman, Samuel Joseph Amouyal, Roi Reichart, +2 more authors, and Moshe TennenholtzarXiv preprint arXiv:2410.05254, 2024Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLMbased agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents’ performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents. We believe that our framework can contribute to the growing intersection of LLMs, ML, and economics, and we encourage researchers to explore it further and build on its foundation.
@article{shapira2024gleeunifiedframeworkbenchmark, journal = {arXiv preprint arXiv:2410.05254}, title = {GLEE: A Unified Framework and Benchmark for Language-based Economic Environments}, author = {Shapira, Eilam and Madmon, Omer and Reinman, Itamar and Amouyal, Samuel Joseph and Reichart, Roi and Tennenholtz, Moshe}, year = {2024}, eprint = {2410.05254}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2410.05254}, } - JAIR
Results for the prediction task introduced in the paper, comparing alternative ways to use data from the 110 human players and from the LLM-generated players. Can Large Language Models Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion GamesarXiv preprint arXiv:2401.17435, 2024Human choice prediction in economic contexts is crucial for applications in marketing, finance, public policy, and more. This task, however, is often constrained by the difficulties in acquiring human choice data. With most experimental economics studies focusing on simple choice settings, the AI community has explored whether LLMs can substitute for humans in these predictions and examined more complex experimental economics settings. However, a key question remains: can LLMs generate training data for human choice prediction? We explore this in language-based persuasion games, a complex economic setting involving natural language in strategic interactions. Our experiments show that models trained on LLMgenerated data can effectively predict human behavior in these games and even outperform models trained on actual human data.
@article{shapira2024can, title = {Can Large Language Models Replace Economic Choice Prediction Labs? The Case of Language-based Persuasion Games}, author = {Shapira, Eilam and Madmon, Omer and Reichart, Roi and Tennenholtz, Moshe}, journal = {arXiv preprint arXiv:2401.17435}, year = {2024}, } - Text2Model: Text-based Model Induction for Zero-shot Image ClassificationOhad Amosy, Tomer Volk, Eilam Shapira, Eyal Ben-David, Roi Reichart, +2 more authors, and Gal ChechikFindings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers, handles rich textual descriptions, and may be adapted to produce lightweight models efficient enough for on-device applications. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.
@inproceedings{amosy-etal-2024-text2model, title = {{T}ext2{M}odel: Text-based Model Induction for Zero-shot Image Classification}, author = {Amosy, Ohad and Volk, Tomer and Shapira, Eilam and Ben-David, Eyal and Reichart, Roi and Chechik, Gal}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-emnlp.8}, pages = {155--172} }