ESP8266 Emulation

From Uzebox Wiki
Jump to navigation Jump to search

This is a design document that explains the inner workings of the ESP8266 emulation system in Uzem. Partially this is a brain dump, but primarily this document serves me to gather my thoughts on what has proven to be more complex than originally anticipated. The earliest version used a method I knew would not be fast but it emulated enough for a simple program to enter a chatroom and communicate with another instance of Uzem to see it's output which left this PC, went to New York, and game back to each emulated Uzebox. There was really no need for proof of concept since sockets programming is a well documented subject, but it was more for inspiration and testing to determine a new method.

Design Goals

The general design goals of this were based on some thinking about the inherent nature of each device: the Uzebox and the ESP8266. The Uzebox must be emulated at full speed or else sound buffers will run dry, game logic run slower, the player will perceive unresponsive control, and video frame update wont happen at the same rate as the native hardware. This heavily distorts the perception of the emulation by the user. Because Uzem is geared towards being used for development where it needs to be very accurate, it does not take shortcuts that would take advantage of common code techniques. In the future programs could be pushing hardware in new ways that require precise timing that works on hardware and so that must work the same in the emulator.

The approach to the ESP8266 emulation is very different from the Uzebox core. Instead of striving for cycle perfect emulation, a high level approach is taken. Here we take advantage of the fact the module need only have correct output to the input from the Uzebox. How this happens does not matter so long as UART commands, which in comparison to cycles happens very slowly, do what they are supposed to. There are many many things that are inherently unpredictable about the timing of this UART response. The module has many things going on with interrupts, analog elements of radio transmission hardware, network collisions, internet latency, etc. You simply cannot write a reliable program on real hardware that relies on cycle perfect responses from the module, because things will not happen the same way twice. Taking advantage of this high level of abstraction possible(since programs must already be timing tolerant) we can emulate the module with good speed when a perfect(and functionally still requiring programs to be very timing tolerant) emulation of this powerful hardware would be significantly more expensive than can be emulated on current PCs. After all, the ESP8266 itself has faster and more complex circuitry than the rest of the Uzebox combined.

It seemed most reasonable to make the Uzebox side the fastest if there are tasks that cannot be divided evenly. This is because even a more powerful chipset is less cycle intensive to emulate when you do not require cycle perfect. The Uzebox side must run full speed. It is possible during some computer lock or otherwise that the ESP8266 side will not run fast enough to meet some protocol timing, like responding to a remote machine(which of course is not lockstepping with our emulation) IRC PING quickly, where failure to meet timing would cause a disconnect. On real hardware this would simply not happen because the ESP8266, mostly, will never stall for a considerable periods...except if there was an issue somewhere on the internet path. So this should be rare, but like the internet, there are potential times when things temporarily break timing due to a complex chain of events beyond control of software. In actuality, the author believes Uzem network performance exceeds the real thing.


Uart Emulation

It was important in the second revision to allow for accurate UART timing which the first version did not have. This is because we need to be able to develop code that correctly deals with baud rates instead of having code that only works on Uzem. It is important to understand how the UART hardware works at the block level. Essentially the 644 clock is fed into the UART circuitry and this clock rate is reduced by a divisor. Software can write to specific memory that will change the value of this divisor and so change the speed, or baud rate, that the UART operates at for both reading and writing. For data to be transferred between devices correctly, this baud rate timing must be the same within a small percentage of error. The emulation of course has 0% timing error on this.

Having the divisor by nature indicates that several machine cycles can pass before the circuitry is done transferring a previous byte. In the same sense it takes the same amount of time for a new byte to be received after just receiving a previous one since it takes time for the other device to send more bytes at this baud rate. Naturally the higher the baud rate, the lower the divisor, and thus the less machine cycles per byte and the higher the overall speed of transfer. The limitation of this is that transfers should not happen so fast that the program cannot keep up with it; else data is lost.

The program needs a way to check if the transmit circuitry is ready to send another byte and also check if new data has arrived so it can quickly read it before new data overwrite it(resulting in lost data). To emulate this, every time a new byte is sent to the transmit hardware the "Ready to transmit a new byte" flag is unset and a counter is set that is the number of cycles it will take before the ready flag is set again. Likewise whenever a new byte is received, the received flag is unset and will not be set at least until the correct number of cycles have elapsed; and of course then only if there is new data. The important aspects that general software cares about are emulated in this way and it is hoped any program that works in Uzem will work on hardware and vice versa.


Speed Issues

The first iteration was very slow and there was no hope my PC would be able to run even 1/4 of full emulation speed. This was because for every few cycles that ran, the ESP8266 emulation update function was called. As said, I knew this was too slow but it was the easiest way to initially get anything working correctly synchronized. At the same time as coding this, I was aware the ultimate solution would have a significant amount of code that Uzem, which is already a cycle intensive program, would need to communicate with and still maintain full speed. Because ultimately even a fairly fast core(as of 2016) does not have a lot of free cycles after emulating Uzebox cycle perfect at real time speed, the idea was to have the operating system help. Since everything these days seems to have 4+ cores, it would be ideal to put the load on another one that doesn't have to do the primary system emulation. Also, during periods of time where a thread is I/O bound awaiting something(frame buffer/audio buffer lock, etc), other threads can run in these otherwise wasted cycles. Because there is generally no good way to force a thread to run on a different core, the best that can be done is to make sure the threads themselves are as efficient as possible and let the OS make it happen fast. Immediately there is the problem, by far the most complex part, thread race conditions.


Because the emulated device uses UART to send and receive data, there must be shared memory places for each to put it's output and at some point the other thread must have access to this output. Sharing memory between threads means that there is the possibility either thread could modify data in between linear instructions in the other thread, protection needs to be built against this at the base so that all reads and writes are guaranteed to be atomic(by cooperation). Since these checks will be made often they need to be fast, so having locks around every memory access would be slower than the naive method first used. There are many many different ways of doing it and indeed high level constructs like Mutexes to do so. Because performance needs to be very high, a general approach was not used and instead research went into a solution that is best for this specific requirement. The most generic solutions lead to threads entirely stalling while awaiting their turn to access data, a common multi threading technique(because this crap is a headache!!!), which during this phase can stall the other if rapid I/O is happening. Mutexes themselves generally work well, but it was believed by the author that a better system could exist to keep both threads running full speed as much as possible; and only stalling a thread if absolutely necessary. And if it is necessary, then better to stall the ESP8266 and let the Uzebox run free.

In the current solution, neither side will ever stall except under situations where a thread stops getting updated due to overloaded processor, OS issues, or what have you. In this case Uzebox side is slowed down to continue accurate emulation, even if less than real time until the threads starts running. Note that this method produces reliable UART behavior on the Uzebox program, but during time sensitive network tasks if this lock is too long, some machine you are interacting with(which of course is real time regardless of emulation issues on your end) might disconnect you or similar. Normally this should not happen but of course the key for a good experience is to have a fast enough machine to run full speed!

UART Memory Contention

Because the basic idea is to avoid both threads needing to access memory at the same time, a method was created that allows the access to mostly happen when it is convenient. Since the output of calculations need to be put somewhere to continue on(else stalling further computation), even if it is not the ultimate destination, it is output somewhere. 2 buffers were used for each direction of communication so that in most cases there is always a place to put the data and move on. This unfortunately complicates things quite a bit. Compexity arises from the need for accurate UART timing. The reason UART timing needs to be emulated is because it has to sit in the emulated UART circuitry for a while to get picked up by the kernel code. Also a program may spend periods of time with UART disabled. If timing is not considered, when a program enables UART 10 minutes later, it would receive a stream from 10 minutes in the past instead of that data being long gone like on real hardware. If character hold time is not used, then chars will overwrite the last faster than the Uzebox program/kernel could ever hope to read the data.


Double Buffers

To always have a place to put things requires 2 buffers as mentioned previously. Putting a string like "HELLO WORLD!" with some chars in 1 buffer and others in the alternate leaves broken portions of the string in each and requires logic and another buffer to remember what came at what time. This is even more complex, and these calculation happening on the Uzebox side are ran millions of times a second, so it's a crap solution. Instead we take advantage of the 1 sided nature of each UART line. The Uzebox Transmit(or Tx) line only ever sends data to the ESP8266 but never receives. This is because this Uzebox Tx is connected to the read only port Receive(or Rx) on the ESP8266. Likewise, the Uzebox Rx only ever reads the data the ESP8266 sends but never puts any data on it. So these are 2 one directional pipes that form a bidirectional channel.

From the Uzebox's point of view it wants to put what data the program has sent, unset the UART Tx ready flag(since the emulated UART hardware is now busy "sending" the data), then set a cycle delay timer until again setting the UART Tx flag. This is to keep realistic timing matching the hardware instead of being able to transfer a byte in a single machine cycle. Likewise is done for the Rx flags and in fact both devices emulate in this same way. The amount of delay between ready status is calculated by the current flags set for baud rate. When these flags do not match, emulation simply scrambles characters to simulate the gibberish received when 2 devices are talking at different baud rates. This is not at all accurate since the way these 2 different timings interact with transistor edge triggering reaches a point where analog effects are critical-if you wanted to be really accurate you would need a perfect Quantum Mechanics emulator(if you can write that please help humanity in that way instead). Luckily it does not matter much since if baud rates are wrong, data is unusable by the program/game. Perhaps some specialized really smart baud rate detector could use the errors to determine if baud is higher or lower but lets cross that bridge when we get there! Also no emulation of UART framing errors is done because it would be as complex as the rest of the ESP8266 core for no gain. UART in Uzem is an idealized form where physical communication errors cannot happen....unless a sub atomic particle flips your ram bit.

The way the double buffers offer advantage is that for each Tx and Rx(that each have 2 for 4 total buffers) 1 of each(Tx/Rx) buffer, or side, is owned by each thread. Now there is less immediate competition and we can take advantage of the slow transfer rates of UART which avoids most lock conditions. We do not need to keep these buffers in sync at the cycle rate of Uzebox core; not even close. When the ESP8266 has something like internet packet data it will write something like "+IPD=11,HELLO WORLD\r\n" to the first buffer that eventually makes it to the Uzebox Rx side. The Uzebox never reads directly from this buffer and so does not need to stall the module emulation. Instead, when the Uzebox has exhausted it's side of the buffer, or indeed before it ever declares a new UART char to the 644 core, it checks to see if there is an active request for the ESP8266 to transfer bytes from it's side to Uzebox side. If it is, it reports no data since the module is or will be modifying Uzebox side of the buffer. The Uzebox is the only one who will set this flag. The ESP8266 is the only side that will unset the flag to indicate it is done with the requested transfer. Only when this flag is not set, and it is then known the ESP8266 will not modify Uzebox side of the buffer until a subsequent request, can Uzebox declare a new char to the 644 core. ESP8266 can never transfer data across buffers without being requested to do so and likewise Uzebox can never get new data without first requesting it. Avoiding the race conditions in this way is critical. More thought might indicate exotic triple buffers or other mechanisms would work even better, maybe, but the author has rewritten, pen and paper theorized, and though about this more times than he would like to recall! I do highly doubt there is any method that yields significantly higher performance or probably terribly hard to maintain if so.

The Uzebox gets it's output to the ESP8266 side in a similar manner. Here the ESP8266 can set a flag indicating a request for the Uzebox side to transfer it's data to the module's side. This is different however, in that the module will not immediately set this flag whenever it is out of data. Instead it will only request this after a certain amount of time has passed since last request. This is to let the Uzebox run free as much as possible, since it takes many cycles even at the fastest possible baud rate for new data to be sent. We do not want to "over poll" which gains nothing and drastically reduces performance on the Uzebox side to rapidly halt and send 1 byte. Since previous paragraphs have explained that there is no effort to create what would be an "cycle perfect" ESP8266 emulator, which is impossible, this is acceptable that the timing between bytes can be different. Indeed the real thing can be interrupted during transfers causing inter character delay. Other than that difference it is essentially the same process. After the ESP8266 sets the flag to request a reconciliation of the buffers(done by Uzebox side), it simply reports no new UART data until the Uzebox unsets this flag indicating it is done doing the transfer. Again, only the ESP8266 will set this flag and only the Uzebox will unset it in the same but mirrored way that ESP8266 to Uzebox transfers work. As long as the flag is set, ESP8266 sees no new data but is free for instance to receive a new packet that just came from the internet. As long as both of these methods are obeyed, data will never get out or order, over written, or otherwise be incorrect causing erroneous emulation.

Temporal Decoherence(Thread Stalls)

Another detail that we must be ready for is threads stalling or running far behind another one. In this case we want the program on the Uzebox to have realistic operation like it would on the real hardware where the ESP8266 does not have random thread stalls. To protect against this, there is no option but to stop the program from executing if the ESP8266 thread has not updated for a long period of time. This is because the program may be waiting a finite amount of time for a response to determine if the module is present and operating correctly. If the real module will always make it in this time period, then the emulated one must also; even if we have to stop time on the program to make it happen.

This is done through a simple mechanism. Whenever the ESP8266 emulation completes a cycle, it sets an Uzebox side counter to 0. Whenever the Uzebox completes an emulated instruction it increments this counter and compares it to a "fluff value" which is the tolerance we allow for threads to be out of synchronization. If the value is too high it means the ESP8266 thread has not run for too long a period of time and so we wait for it to reset this value to 0 indicating it update again. The nature of the ESP8266 emulation is that everything it possibly can do with the given UART input(completed commands) or internet data(received data from other machines) it does so then loops. Since the real UART input and output with all the hardware and complex effects of the ESP8266 are variable anyways, a certain amount of this is tolerable. Setting this value too low would result in constant locking of the Uzebox thread which is slow, setting this value too high would eliminate it ever happening but could result in unrealistic program operation on the Uzebox side. There is a direct connection with the current UART baud rate and how high this value can be and still be realistic. During normal operation when your PC is not loaded down too much, this should never happen.


Conclusion

A bit of history that might help explain how happy I am that this is working. The second attempt after the first naive way used a concept of rolling pointers for each single buffer where it was made sure reading never happened past what was written. Whenever the program read a byte from UART hardware it popped off the buffer and likewise for transmitting. This worked and I made several small experiments accessing resources and communicating with other machines on the internet. Then I ran in to a serious problem when I went past trivial experiments!


Phew, in wording it all sounds manageable and it is, but after so many experiments and trying more exotic things, it was almost a relief to find that more complex methods have other pitfalls and actually don't improve performance over this less complex way. It is orders of magnitudes faster than the original naive way.