Skip to content

machine learning

Turbocharge your tokenization by exploiting parallelism

Parallelize Hugging Face Tokenizers with num_proc

Parallelize Hugging Face Tokenizers with num_proc

Processing large datasets can be time-consuming, especially when it comes to tokenizing text.

But what if you could reduce your tokenization time from hours to mere minutes? Without any extra effort? 🤯

In this blog post, we'll show you how to parallelize your tokenization using Hugging Face's num_proc parameter.

Wordcab Transcribe - An open-source ASR solution using Whisper, Docker and FastAPI

Automatic Speech Recognition (ASR) has become an essential tool for developers and businesses. With Wordcab Transcribe, you can leverage ASR in your projects without relying on expensive third-party platforms.

We've implemented an open-source ASR solution using Docker, FastAPI, and the faster-whisper library, which is a fast implementation of the transcription model from OpenAI Whisper.

This project utilizes CTranslate2 under the hood to speed up the processing of audio files while requiring less than 5GB of VRAM on the GPU with the large-v2 Whisper model.

In this blog post, we'll present the Wordcab Transcribe project and show you how to use it in your own applications.

Keep your workstation clean - Docker

Optimize Docker Storage for Machine Learning

When working with Machine Learning, especially with large images like NVIDIA ones for training models on GPUs, it is important to manage your workstation storage efficiently.

Docker is a great tool for containerization, providing a consistent environment for deploying applications.

However, as you create and run containers, unused files and storage may accumulate on your system.

In this post, we'll cover how to use Docker commands to prevent unused files and storage from cluttering your workstation.