Introduction
Aim of this project is to make a physical, WiFi connected, Internet radio client (regardless of the slightly confusing name). Unlike many other projects, this is not my lifelong dream or anything, it's just that I'm very intrigued by connecting ESP32 to an audio decoder chip (VS1053b in this case) and by various possibilities these bring to the mix. I thought of this originally already when I was using an ESP8266 before ESP32 was released but I did not have enough knowledge back then and failed to google such project, so cannot really say if someone has already done such a thing then. Later on I had this idea again already with an ESP32 and after a quick search I found this mess. After skipping through the video without having the patience to watch it completely, I just decided to make one, although without the wire spaghetti and the Arduino crap.
Unlike other projects I've described here, I have a need to make this work as soon as possible. So this project could develop very fast (and have obvious issues) or it might never be finished. Who knows? However there's much to discover and learn here. I will start prototyping this project with eBay modules, so in the beginning it will be a mess of wires (like in the video) and will mostly be about the code, which I am making mostly from scratch, since I simply don't want to read 5k+ lines of code. There are also many new things here for me, both in software and in hardware, so this might get interesting.
I had some difficulties finding sources for this project (that don't involve Arduino) but here's one application note from microchip, here's one post in stackoverflow and one assignment(?). Also here is an Arduino library for VS1053b. Other sources are mentioned later on in this post. These were however a bit difficult to properly use in the text.
Background
Technically this project is very simple. An ESP32 is used to open a connection to a specific server which then sends audio data back to the ESP. The ESP should buffer this data to even out network latencies and stream it to the VS1053b decoder chip. The decoder chip then directly outputs analog signal which only needs to be filtered and possibly decoupled. The decoder chip is also able to output audio in a digital format, which would make it possible to connect it to a proper DAC if necessary.
SHOUTcast
The basic idea of SHOUTcast (or icecast) is basically to stream audio data. There's not much into it. First the server will transmit some kind of a header, which includes station name, genre, bit-rate and such. After that the server will start outputting audio data frame by frame. I'm not sure about the other formats, but at least MP3's contain a header in each frame, so the decoder chip can get all the needed information from the first few bytes of each frame. The server is also capable of sending metadata between chunks of audio data if the client requests so. The resulting data-stream would look as shown below assuming the metadata is requested. If the metadata is not requested, the stream only contains the header and continuous audio data.
[header] + [audio data] + [meta size] + [meta] + [audio data] + ...
1. The end of the header can be easily detected from the standard "\r\n\r\n" ending.
2. The size of the audio data chunk is specified in the header with the "meta-int" tag.
3. The "meta size" x 16 tells us how many chars there are in the metadata. Metadata is padded with zeros, if the length is not divisible by 16.
VS1053b
This is a very powerful chip that is able to decode MP3, AAC and OGG formats among others. The chip is also able to directly output analog signal as already mentioned and can drive headphones directly. The downside to this chip is its price, pitch of the pins and the amount of filter caps and routing it requires. However the good part is that it doesn't really need to be configured in any way. Basically it's enough to set the internal frequency, output volume and then just stream any supported audio data through the SPI bus. This makes the project very very simple. However there's much more to this chip, but that's for another time (but definitely for this same project).
Electronics
I will not provide any schematics here, since I simply improvised some connection between the ESP32 and the VS1053b. There is one SPI bus with two CS (Chip Select) pins, one for data and one for control. Additionally there is a reset signal and DREQ (Data REQuest). All this is exceptionally well documented in the datasheet. I used an ESP32-T development board, some VS1053b module from eBay (the oldest design I suppose, the one that doesn't look like an Arduino shield) and obviously DuPont cables.
A quick note here about the decoder module from eBay. It seems that the proper filter caps for the chip power are missing. Possibly for this reason the existing caps are making high pitched noise, which is quite loud and very very annoying. There is no noise in the audio output however. Additionally according to the datasheet the GPIOs of the chip should be connected to ground if not used, and they don't seem to be connected to ground in this board. This might be the reason that the chip does not automatically decode incoming MP3 audio and requires a few specific writes to do so.
I've also decided to build a prototype of the whole device using similar modules. However there is no such module for the display I want to use, so I had to design one myself. While at it, I also made a module that would fit a rotary encoder of my choosing and two buttons with pull-ups and filters. But that is for the next part.
Code
I've programmed this device using a lot of trial and error. The result can be split into five easy steps. This could be explained with many more steps, but I would like to strip down some things, like the ever changing API of ESP32. I only want to explain the general idea here.
Step 1, HTTP(S) GET
"Host: "WEB_SERVER":"WEB_PORT"\r\n"
"User-Agent: esp-idf/1.0 esp32\r\n"
"Accept: */*\r\n"
"Icy-MetaData:1\r\n"
"Connection: close\r\n"
"\r\n";
Step 2, Parse header
The HTTP header can be used in this case to read the name of the stream (icy-name tag), genre (icy-genre tag), bit-rate (icy-br tag), audio format (Content-Type tag) and such. The bit-rate, sample-rate and audio format can be also read from the audio stream or the VS1053b chip after it starts decoding so it's not necessary to read these from the header. However it is necessary to read the icy-metaint value if the metadata has been requested. Other tags do not really affect functionality but are still good to fetch for example for displaying info on an LCD (like in the following parts of this project). Note: the Content-Type tag will say audio/mpeg for MP3 and audio/aacp for AAC.
Parsing the header is quite simple. The re-entrant strtok_r function can be used to first split the received data into lines and then another call can be used to split the tags and the payload. For testing these kind of things I like to use the C Playground. I just dump some example data from the device and try to parse it in the C Playground. It really speeds up the development process when I only need to test one piece of code and don't need to recompile the whole project and reprogram the ESP32. At least I'm not good enough to be able to write such code from the first try.
Speaking of testing, all the data can also be easily fetched on a PC using Unix tools. This could be done for example using wget as described here. I've personally also used curl. Both of these commands have to be terminated manually or they will keep loading data indefinitely. There is also a nice Unix tool called xxd that can be used to view hex data. Combined with less, xxd should be enough to inspect the stream data. Both of the commands also request the meta-data so that all that is written in this post can be easily verified.
wget --header="Icy-MetaData:1" -S -O reply.txt [url](Step 3, Parse metadata)
Technically audio data comes after the header, but in my opinion this step fits here better. Reading the metadata requires first waiting for "metaint" amount of bytes of audio data. After that, one byte should be read from the stream and multiplied by 16. The result is the amount of bytes that should be read as metadata (and not transferred to the decoder chip). After reading these bytes, the software should continue buffering audio data. An example of metadata is shown below.
StreamTitle='title of the song';
Parsing this syntax is quite annoying since the ' character can appear in the middle of the string. However I'm not sure if the ; character can appear in the payload. If it's not allowed, the whole string could be first split by the ; character and then = character, after which the first and the last character could be discarded.
Another issue here, at least at this level of programming, is UTF-8. Since we are making a simple embedded system, we do not want to support 1 112 064 different characters that the strings may contain. For this reason there should be at least some check that would simply replace the characters that our device cannot display with an underscore or any other character. I suppose for European radio stations it's enough to support Latin characters and the same but with all kind of hats (like åäöáàâ etc).
Step 4, Buffer audio data
Since there might be rather long latencies in the WiFi/Internet connection, some audio data should be buffered before forwarding it to the decoder chip. A simple circular buffer should suffice in this case. This source states that 300ms buffer should be enough, however I made a much longer buffer since there is plenty of memory in the ESP32.
Making a circular buffer is easy, it's just a large enough array with a pointer that wraps around to the beginning of the array after reaching the end. There should actually be two pointers, one for writing and one for reading. Special care should be taken so that the pointers don't pass each other, which would cause an ugly glitch in the resulting audio. It took me some time to implement this because of the way these two pointers are incremented.
The write pointer is incremented with whatever amount of bytes the receive function manages to receive from the Internet. That cannot really be specified, although I could make some function to write only until 32 byte boundary in the buffer. The read pointer is however always incremented by 32, because the datasheet specifies that at least 32 bytes can be written when DREQ goes high. For this reason I've implemented it in such a way that it does transmit exactly 32 bytes. Now this poses an issue, because we cannot check whether the read pointer is equal to write pointer, because there's 31/32 chance that the read pointer will skip write pointer and we will have a glitch. For this reason I've decided to make a check like so: read_ptr / 32 == write_ptr / 32. If the previous statement is true, the buffer is "empty" and that is technically an error (since the internal buffer of the decoder chip is quite small) and no audio data should be sent to the decoder (since there is no new data).
Originally I've expected the buffer never to overflow, since technically the decoder should consume the data at the same rate that the server sends it. More over I've assumed that the client (my device) should buffer some data first before sending it forward to the decoder. However that is apparently not quite so. It seems that the server sends data much faster right after opening the connection. I suppose this is so that the client can both fill the buffer and start playing immediately. This caused buffer overflows in my device, so I had to throttle the reception, because there's simply not enough memory for the amount of data some servers send. Making this functionality was easy, since the incoming data is processed byte by byte. The code was reduced to a simple "while((read_ptr + 1) % AUDIO_BUFFER_SIZE == write_ptr);" since reading is implemented in an interrupt and will not be prevented by this loop. This check should however be done before incrementing the pointer so that the device will not think that the buffer is empty.
Step 5, Transfer audio data to the decoder chip
As already mentioned, the easiest implementation is to transmit 32 bytes whenever the DREQ signal of the VS1053b goes high. However there might be large delays when receiving data from the Internet and the internal buffer of the decoder chip might not be enough for such a time. For this reason I've used a timer that calls an interrupt handler. This interrupt handler will then transmit audio data to the chip whenever the DREQ is high. This will interrupt any other ongoing tasks like the data processing, however it will not interrupt anything critical since those tasks should run on the other core and cannot be interrupted by the user program (AFAIK).
Making the transfer itself is quite simple. First the code should check whether DREQ is high. Then it should check whether there is at least 32 bytes of data in the buffer. If both checks pass, 32 bytes should be transmitted, the read pointer should be incremented by 32 and wrapped around if necessary.
Result
The result of this post is a completely hard-coded ESP32 Web Radio player that works mostly. It successfully connects to unsecured HTTP servers and streams data. It can also successfully strip and decode the header and metadata and print the results via serial line. The next step is to add a display, some inputs and make it configurable.
It feels amazing that less than 500 lines of code (excluding the WiFi and Internet connection code) is enough to make a hard coded Internet radio player. And most of this code is mostly written from scratch by me. However the length of the code might grow drastically as soon as the display is added and "hardcodeness" is removed. :)
Final words
This project is progressing quite rapidly. However I will need a breakout board to be able to prototype with the display that I want to use. And speaking of which, I only have one such display and more will be in stock in ... April apparently. I am not joking. This project might thus take quite a long time to be fully functional. I just hope I can have the first prototype PCB before my next vacation. :)
The most unfortunate thing about this project is that I've started it using the oldest ESP32 module that I had and there has been at least two new chip revisions after that. I don't really know if this affects anything in a meaningful way though. Additionally I also hope that this device will work in the location that it is designed for since I cannot often test it there. It might require an external WiFi antenna, but fortunately ESP32 modules with an external antenna are available.
Spoiler: I've been researching the VS1053b chip and there seems to be all kind of features including bass/treble control and plugin support that allows both an equalizer and a spectrum analyzer. With this it would be easy to make some kind of audio visualisation on a display or using external LEDs. In short, this decoder chip is very capable and I'm very interested in researching some of the available features.