ARIA - Artistic Rendering of Images into Audio

ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.

Model Description

  • Developed by: Vincent Amato
  • Model type: Multimodal (Image-to-MIDI) Generation
  • Language(s): English
  • License: MIT
  • Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
  • Repository: GitHub

Model Architecture

ARIA consists of two main components:

  1. A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
  2. A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values

The model offers three different conditioning modes:

  • continuous_concat: Emotions as continuous vectors concatenated to all tokens
  • continuous_token: Emotions as continuous vectors prepended to sequence
  • discrete_token: Emotions quantized into discrete tokens

Usage

The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:

  • model.pt: The trained model weights
  • mappings.pt: Token mappings for MIDI generation
  • model_config.pt: Model configuration

Additionally, image_encoder.pt contains the CLIP-based image emotion encoder.

Intended Use

This model is designed for:

  • Generating music that matches the emotional content of artwork
  • Exploring emotional transfer between visual and musical domains
  • Creative applications in art and music generation

Limitations

  • Music generation quality depends on the emotional interpretation of input images
  • Generated MIDI may require human curation for professional use
  • Model's emotional understanding is limited to valence-arousal space

Training Data

The model combines:

  1. Image encoder: Uses ArtBench with emotional annotations
  2. MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project

Attribution

This project builds upon:

  • midi-emotion by Serkan Sulun et al. (GitHub)
    • Paper: "Symbolic music generation conditioned on continuous-valued emotions" (IEEE Access)
    • Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
  • CLIP by OpenAI for the base image encoder architecture

License

This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using vincentamato/ARIA 1