Chord

August 21, 2023

Is there any way to use stable diffusion with huggingface diffuser without text encoder? I only do image-to-image generation without any prompt.

My research process involved reviewing various sources, including Reddit discussions, GitHub repositories, and Hugging Face documentation. The sources provided a variety of opinions and information on image-to-image generation using Stable Diffusion without text encoders. While some sources provided direct answers and solutions, others discussed related topics or provided indirect suggestions. Overall, there seems to be a general consensus on the possibility of using Stable Diffusion for image-to-image generation without text encoders, but the exact implementation details may vary .

Words

426

Time

2m 4s

Contributors

507

Words read

40.2k

Jump to research

Composed by

S. L.

Views

Version history

S. L., 347d ago

Have an opinion? Send us proposed edits/additions and we may incorporate them into this article with credit.

Hugging Face Stable Diffusion Demo for Image-to-Image Generation

A demo for img2img mode of Stable Diffusion is available for free on Hugging Face (HF), which can be used for image-to-image generation without any prompt. The demo allows users to combine other models like dalle-mini with Stable Diffusion and has been praised for its high-quality generation results. The demo can be accessed using HF spaces, and examples of generated images can be found on Reddit and HF spaces.

"You can easily and freely try out img2img mode of Stable Diffusion in this demo [\https://huggingface.co/spaces/huggingface/diffuse-the-rest\](\https://huggingface.co/spaces/huggingface/diffuse-the-rest\). This lets you do cool things like turn your drawings into images or combine other models like dalle-mini with stable diffusion."

StableDiffusionImg2ImgPipeline

The StableDiffusionImg2ImgPipeline from Hugging Face Diffusers library can be used for image-to-image generation without any text prompt. By not defining the `prompt` argument when using the pipeline, it can generate images without any text guidance. This pipeline is designed for text-guided image-to-image generation using Stable Diffusion and can be customized using various input arguments to control the generation process.

Colab Notebooks and GitHub Repositories

Several Colab notebooks and GitHub repositories demonstrate the use of Stable Diffusion for image-to-image generation. One example is a Colab notebook that shows methods to train Stable Diffusion on textual prompts using textual inversion. Although the notebook focuses on textual prompts, it may provide insights on how to adapt the process for image-to-image generation without text encoders. Additionally, a Kaggle notebook by josepc uses Hugging Face diffusers for image-to-image generation.

"Teach Stable Diffusion new concepts with Textual Inversion (add to the public library if you wish):"

Local Installation and Usage

For users interested in running Stable Diffusion locally on their computers, there are guides available for barebones installation with command-line executions or with a graphical user interface (GUI). These guides can help users set up Stable Diffusion on their machines and potentially adapt the configurations for image-to-image generation without text encoders.

Conclusion

Based on the available sources, it is possible to use Stable Diffusion with Hugging Face diffuser for image-to-image generation without text encoders. Users can explore the Hugging Face Stable Diffusion demo, the StableDiffusionImg2ImgPipeline, and various Colab notebooks and GitHub repositories to find suitable solutions for their specific needs. Additionally, users can consider installing Stable Diffusion locally on their machines for further customization and control over the generation process.

Jump to top

Research

"https://github.com/huggingface/diffusers"

Notes:

Hugging Face Diffusers is a library for state-of-the-art diffusion models for generating images, audio, and 3D structures of molecules.
It provides different diffusion pipelines, interchangeable noise schedulers for different diffusion speeds and output quality, and pretrained models as building blocks.
The library is aimed to be user-friendly, customizable, and modular.
It also aims to provide ease of use over performance.
Core components of the library include state-of-the-art diffusion pipelines that can be run in inference with just a few lines of code, interchangeable noise schedulers for different diffusion speeds and output quality, and pretrained models that can be used as building blocks with schedulers for creating end-to-end diffusion systems.
To install Hugging Face Diffusers, it is recommended to install it in a virtual environment from PyPi or Conda.
The website offers installation commands using pip and conda for both PyTorch and Flax.
The website also offers a guide for Apple Silicon (M1/M2) support.
A Quickstart tutorial is available to launch your diffusion journey today, with code snippets to generate outputs using pretrained models.
Different guides are offered on how to load and configure all the components of the library, as well as how to use different schedulers.
Guides for using pipelines are available for different inference tasks, including batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library.
Optimization guides are available to run diffusion models faster and consume less memory.
Training guides are available for different tasks and with different training techniques.
The library offers contributions from the open-source community, including a Good first issues section for general opportunities to contribute.
Popular tasks and pipelines are listed on the website, such as Unconditional Image Generation, Text-to-Image, Text-guided Image-to-Image, Text-guided Image Inpainting, and Super Resolution.
Popular libraries using Hugging Face Diffusers are also listed on the website, including Microsoft TaskMatrix, InvokeAI, and Apple ml-stable-diffusion.
The library concretizes previous work by different authors, and the website acknowledges their implementation’s help in library development.
A citation is available for the library.
The website does not directly answer the query about using stable diffusion with Huggingface Diffuser without text encoder.

"List of Stable Diffusion systems"

This is a Reddit post of a list of stable diffusion systems and resources.
The post was made 1 year and 4 days ago in the r/StableDiffusion thread and has received 1050 points.
The author’s intention is to list only the more popular and/or better systems and other resources, although there are some items on the list that might not meet that standard.
There is a processing backlog of hundreds, maybe thousands, of Reddit posts as of November 18th, 2022.
The author includes links for installation and usage guides for various systems on the list.
The list includes some recommended systems, denoted by *PICK*.
The author has not tried some of the listed items – including *PICK* items – so it is advised to use them at the user’s own discretion.
The list is divided into 4 parts: Web apps, Google Colab notebooks, resources, and miscellaneous systems.
Under miscellaneous systems, there is a GitHub repo by CompVis that is recommended as a *PICK* item.
There is a Windows program called Stable Diffusion GRisk that is listed under miscellaneous systems.
A Kaggle notebook by josepc is listed, which uses HuggingFace diffusers.
Pixelmind has listed multiple systems, including a web app and Discord bot.
Pixelz AI has several apps that use Stable Diffusion.
bes-dev has a GitHub repo that is compatible with OpenVINO.
NMKD Stable Diffusion GUI is a Windows program that is listed as a *PICK* item.
AbdBarho’s GitHub repo can be used to run Stable Diffusion on your machine with a nice UI without any hassle.
Lightricks has Apple apps and a Google Play app that uses Stable Diffusion.
Visions of Chaos, a Windows program, has many scripts, including text-to-image systems that use Stable Diffusion.
cmdr2’s GitHub repo provides the easiest 1-click way to install and use Stable Diffusion on your own computer and provides a browser UI for generating images from text prompts and images.
There are multiple apps for Wonder that use Stable Diffusion.
AUTOMATIC1111’s Stable Diffusion web UI provides a browser interface based on Gradio library for Stable Diffusion and is a *PICK* item.
Engineer-of-Stuff’s GitHub repo of Jupyter notebooks can be used for Paperspace.
nateraw

"Teach new concepts to Stable Diffusion with 3-5 images only - and browse a library of learned concepts to use"

Notes:

The webpage describes Stable Diffusion, which is a method for generating high-quality images using diffusion models.
Stable Diffusion is a probabilistic model that uses a diffusion process to generate images. It works by iteratively adding noise to an image until it resembles the target image.
Textual inversion is a technique that can be used with Stable Diffusion to generate images from prompts.
A link to a Google Colab Notebook is provided, which shows how to use textual inversion to train Stable Diffusion on a textual prompt.
Another link is provided to a library of learned concepts that can be used with Stable Diffusion.
Several Reddit users ask questions and provide feedback on the links.
Some Reddit users report that they are unable to run the Colab Notebook due to GPU memory issues or other errors.
Other Reddit users provide tips for modifying the configuration of Stable Diffusion to improve performance on lower-end GPUs.
Links to other resources related to Stable Diffusion and image generation are provided.
A user asks if there is a way to use Stable Diffusion with Hugging Face diffuser without a text encoder.
No direct answer is provided to this question, although the webpage provides information about using Stable Diffusion with textual inversion to generate images from prompts, which is one way to use it without a text encoder.
Overall, the webpage provides valuable information about using Stable Diffusion, as well as related methods and tools, to generate high-quality images.

"Guide for DreamBooth with 8GB vram under Windows"

A tutorial, posted on August 2, 2021, has instructions on how to use DreamBooth with 8GB Vram under Windows.
The tutorial uses hugging face diffusers and deepspeed to train models.
There may be some unnecessary steps, but following the instructions should enable users to run the model.
Some knowledge of Linux is helpful.
The tutorial uses the repo/branch posted earlier, and modifies another guide.
These are the suggested steps:
- Install the GPU driver you need.
- Install WSL2 and Ubuntu on it.
- Update Ubuntu and install the necessary packages.
- Clone the deepspeed repo, and then run a script.
- Clone the DreamBooth repo, install its dependencies, and then run the script.
- Finally, run the last script.
There is a comment thread discussing some aspects of the guide:
- One user asked about quality comparisons between the 8GB and 24GB versions, to which another user replied that it should be similar to other versions that use hugging face diffusers, except for the 8bit adam flag.
- Another user asked if the guide can be used without using a text encoder, and if it is possible to have image-to-image generation without any prompt.
- There is a discussion about regularization images and their importance in preventing overtraining.
- One user reported that the guide worked after adapting it for a bare-metal installation of Ubuntu 20.04.
- Some users reported details about their own experiences and setups.
The instructions were updated on August 17, 2021, to use ShivamShrirao’s repo instead.
The official DreamBooth GitHub page has more information about the different models.

The Stable Diffusion model can be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images.
The StableDiffusionImg2ImgPipeline uses the diffusion-denoising mechanism proposed in SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations.
SDEdit synthesizes realistic images by iteratively denoising through a stochastic differential equation (SDE) prior to increase its realism.
The model doesn’t require task-specific training or inversions and can achieve the balance between realism and faithfulness.
StableDiffusionImg2ImgPipeline is a pipeline for text-guided image-to-image generation using Stable Diffusion.
The pipeline can take an initial image as an input argument and generate images without any text prompt by not defining the prompt argument.
The pipeline inherits from DiffusionPipeline and has generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
The pipeline has loading methods such as from_single_file() for loading .ckpt files, load_textual_inversion() for loading textual inversion embeddings, load_lora_weights() for loading LoRA weights, and save_lora_weights() for saving LoRA weights.
The __call__ function of the pipeline is used for generation and can take various arguments such as image, strength, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, and cross_attention_kwargs.
strength indicates the extent to transform the reference image, num_inference_steps refers to the number of denoising steps, guidance_scale encourages the model to generate images closely linked to the text prompt at the expense of lower image quality.
negative_prompt is the prompt to guide what not to include in image generation, num_images_per_prompt is the number of images to generate per prompt, eta is the parameter from the DDIM paper.
generator is a torch.Generator to make generation deterministic, prompt_embeds are pre-generated text embeddings that can be used to tweak text inputs, negative_prompt_embeds are pre-generated negative text embeddings.
If return_dict is True, a StableDiffusionPipelineOutput

"[N] Diffusers: Introducing Hugging Face's new library for diffusion models."

Hugging Face has launched an open-source toolbox for diffusion techniques called Diffusers.
Diffusers aims to centralize the most important open-sourced research on diffusion models and make them more accessible and easier to use for the community. They also provide the community with simple yet powerful training utilities to build powerful systems transparently.
Diffusers is a modular toolbox for diffusion techniques focusing on: inference pipelines, schedulers, models, and training examples.
Diffusion models have recently gained a lot of interest from the ML community since they play an essential role for models like DALL-E or Imagen generating high-quality, photorealistic images when prompted on text.
Diffusion models have achieved remarkable results in other domains, such as video generation, audio synthesis, and reinforcement learning.
Most recent research on diffusion models (namely, Dalle-2 and Imagen) have not been made accessible to machine learning and are often behind the closed doors of large tech companies.
Diffusers aims to solve this problem by providing the community with simple yet powerful training utilities.
The website provides links to the Github page for Diffusers and a walkthrough collab notebook.
According to one of the comments’ answers, diffusion models may not be adapted to NLP tasks because most NLP tasks take discrete inputs/outputs.
One comment provides a source where you can define a forward/backward process on a categorical distribution instead of a Gaussian, which works for discrete variables.
Another comment highlights a paper applying diffusion models to language tasks.
One of the comments discusses how diffusion models can be used in reinforcement learning agents to generate plans for higher reward. It refers to the Berkeley-based Diffuser algorithm, which uses a diffusion model to generate desirable trajectories for the agent.
The benefits of using diffusion models in this context are that the models generate plans that receive high rewards, which is useful because search is very hard in reinforcement learning agents, particularly with high-dimensional action spaces.
Another comment mentions that diffusion models could generate additional training data in image-based RL learning or generate a desired trajectory for video generation.
Yet, the same comment explains the superiority of diffusion models over decision transformers as the former is randomly seeded, making it capable of generating interesting new behaviors compared to imitation learning-based methods.

"[D] Easily Run Stable Diffusion Image to Image mode"

A demo for img2img mode of Stable Diffusion is available for free on Hugging Face (HF) which can be used for image-to-image generation.
Result examples of img2img mode are available which display high-quality generation.
The HF demo can be used to combine other models like dalle-mini with Stable Diffusion.
Code examples on GitHub demonstrate Stable Diffusion use for image-to-image text-guided generation.
A Colab notebook is available to show image-to-image generation using diffusers.
The demo can be accessed using HF spaces.
A Reddit user shared an image of “Trollface bust by Michelangelo” which was generated using Stable Diffusion img2img mode.
Another user recommended viewing additional generation examples on HF spaces.
Some users found the results generated using Stable Diffusion to be really good and amazing.
A user asked if the program can be downloaded for local use on their computer.
Another user responded with links to guides for a barebones installation with command-line executions or with GUI that can be used to run Stable Diffusion locally.

"Stable Diffusion with Diffusers - Hugging Face"

Not used in article

💭 Looking into

Description of Huggingface diffuser and how to use it for image-to-image generation without a text encoder.