mips_r4300i (u/mips_r4300i)

mips_r4300i commented on Memory Mapping an FPGA from an STM32 serd.es/2024/07/24/Memory... · Posted by u/hasheddan

What the hell is going on at ST? Every STM uC I've tried to use in the past few years has had showstopper bugs with loads of very similar complaints online dating back to the release of the part. Bugs that have been in the wild for years and still exist in the current production run.

After burning enough company time chasing bugs through ST's crappy silicon, I've had to just swear them off entirely. We're an Atmel house now. Significantly fewer (zero) problems, and some pretty nifty features like UPDI.

mips_r4300i · a year ago

They churn out new parts and don't bring in fixes. See all the chips in their lineup that have a USB host controller. Every one of them (they use Synopsys IP) will fail with multiple LS devices through a hub. We talked to our FAE about this and they have no plans to fix it. The bug has existed for years and the bad IP is being baked into all the new chips still. Solution? Just use yet another chip for its host controller, and don't use a hub.

mips_r4300i commented on Memory Mapping an FPGA from an STM32 serd.es/2024/07/24/Memory... · Posted by u/hasheddan

dmitrygr · a year ago

Be veeeery careful. STM32H QSPI peripheral is FULL OF very nasty bugs, especially the second version (supports writes) that you find in STM32H0B chips . You are currently avoiding them by having QSPI mapped as device memory, but the minute you attempt to use it with cache or run code from it, or (god help you) put your stack, heap, and/or vector table on a QSPI device, you are in for a world of poorly-debuggable 1:1,000,000 failures. STM knows but refuses to publicly acknowledge, even if they privately admit some other customers have "hit similar issues". Issues I've found, demonstrated to them, and wrote reliable replications of:

* non-4-byte-sized writes randomly lost about 1/million writes if QSPI is writeable and not cached

* non-4-byte-sized writes randomly rounded up in size to 2 or 4 bytes with garbage, overwriting nearby data about 1/million writes if QSPI is writeable and cached

* when PC, SP, and VTOR all point to QSPI memory, any interrupt has about a 1/million chance of reading garbage instead of the proper vector from the vector table if it interrupts a LDM/STM instruction targeting the QSPI memory and it is cached and misses the cache

Some of these have workarounds that I found (contact me). I am refusing to disclose them to STM until they acknowledge the bugs publicly.

I recommend NOT using STM32H7 chips in any product where you want QSPI memory to work properly.

mips_r4300i · a year ago

Thanks for the heads up. I have a design at fab that uses the H7's OctoSPI so this concerns me. I steered away from the memory mapped mode because it seemed too good to be true - wanted to be able to qsort() and put heaps in this extra space.

I suspect ST only ever tested it with their single PSRAM they intend this mode for. My intent is to use indirect mode and manually poke the peripheral, though DMA will have to happen still.

Back on the PIC32MX platform there was a similar type of bug that doesn't exist anywhere else but to me: If any interrupt fires while the PMP peripheral is doing a DMA, there is a 1 in a million chance that it will silently drop 1 byte. Noticed this because all my accesses were 32bit (4 bytes) and broke horribly at the misalignment. The solution is to disable all interupts while doing DMA.

mips_r4300i commented on FuryGpu – Custom PCIe FPGA GPU furygpu.com... · Posted by u/argulane

0xcde4c3db · a year ago

I've been told by several people that distributor pricing for FPGAs is ridiculously inflated compared to what direct customers pay, and considering that one can apparently get a dev board on AliExpress for about $110 [1] while Digikey lists the FPGA alone for about $1880 [2], I believe it (this example isn't an UltraScale chip, but it is significantly bigger than the usual low-end Zynq 7000 boards sold to undergrads and tinkerers).

[1] https://www.aliexpress.us/item/3256806069467487.html

[2] https://www.digikey.com/en/products/detail/amd/XC7K325T-1FFG...

mips_r4300i · a year ago

This is both true and false. While I work with Intel/Altera, Xilinx is basically the same.

That devboard is using recycled chips 100 percent. Their cost is almost nothing.

The kintex-7 part in question can probably be bought in volume quantities for around $190. Think 100kEAU.

This kind of price break comes with volume and is common with many other kinds of silicon besides FPGAs. Some product lines have more pricing pressure than others. For example, very popular MCUs may not get as wide of a price break. Some manufacturers price more fairly to distributors, some allow very large discounts.

mips_r4300i commented on FuryGpu – Custom PCIe FPGA GPU furygpu.com... · Posted by u/argulane

snvzz · a year ago

Pipeline seems retro, but far better than nothing.

There's no open hardware GPU to speak of. Depending on license (can't find information?), this could be the first, and a starting point for more.

mips_r4300i · a year ago

Ticket2Ride Number9 is a fixed function GPU from the late 90s that was completely open sourced under GPL

mips_r4300i commented on Sega Saturn Architecture – A practical analysis (2021) copetti.org/writings/cons... · Posted by u/StefanBatory

33985868 · a year ago

There were many microcode versions and variants released over the years. IIRC one of the official figures was ~180k tri/sec.

I could draw a ~167,600 tri opaque model with all features (shaded, lit by three directional lights plus an ambient one, textured, Z-buffered, anti-aliased, one cycle), plus some large debug overlays (anti-aliased wireframes for text, 3D axes, Blender-style grid, almost fullscreen transparent planes & 32-vert rings) at 2 FPS/~424 ms per frame at 640x476@32bpp, 3 FPS/~331ms at 320x240@32bpp, 3 FPS/~309ms at 320x240@16bpp.

That'd be between around 400k to 540k tri/sec. Sounds weird, right ? But that's extrapolated straight from the CPU counter on real hardware and eyeballing, so it's hard to argue.

I assume the bottleneck at that point is the RSP processing all the geometry, a lot of them will be backface culled, and because of the sheer density at such a low resolution, comparatively most of them will be drawn in no time by the RDP. Or, y'know, the bandwidth. Haven't measured, sorry.

Performance depends on many variables, one of which is how the asset converter itself can optimise the draw calls. The one I used, a slight variant of objn64, prefers duplicating vertices just so it can fully load the cache in one DMA command (gSPVertex) while also maximising gSP2Triangle commands IIRC (check the source if curious). But there's no doubt many other ways of efficiently loading and drawing meshes, not to mention all the ways you could batch the scene graph for things more complex than a demo.

Anyways, the particular result above was with the low-precision F3DEX2 microcode (gspF3DLX2_Rej_fifo), it doubles the vertex cache size in DMEM from 32 to 64 entries, but removes the clipping code: polygons too close to the camera get trivially rejected. The other side effect with objn64 is that the larger vertex cache massively reduces the memory footprint (far less duplication): might've shaved off like 1 MB off the 4 MB compiled data.

Compared to the full precision F3DEX2, my comment said: `~1.25x faster. ~1.4x faster when maxing out the vertex cache.`.

All the microcodes I used have a 16 KB FIFO command buffer held in RDRAM (as opposed to the RSP's DMEM for XBUS microcodes). It goes like this if memory serves right:

1. CPU starts RSP graphics task with a given microcode and display list to interpret from RAM

2. RSP DMAs display list from RAM to DMEM and interprets it

3. RSP generates RDP commands into a FIFO in either RDRAM or DMEM

4. When output command buffer is full, it waits for the RDP to be ready and then asks it to execute the command buffer

5. The RDP reads the 64-bit commands via either the RDRAM or the cross-bus which is the 128-bit internal bus connecting them together, so it avoids RDRAM bus contention.

6. Once the RDP is done, go to step 2/3.

To quote the manual:

> The size of the internal buffer used for passing RDP commands is smaller with the XBUS microcode than with the normal FIFO microcode (around 1 Kbyte). As a result, when large OBJECTS (that take time for RDP graphics processing) are continuously rendered, the internal buffer fills up and the RSP halts until the internal buffer becomes free again. This creates a bottleneck and can also slow RSP calculations. Additionally, audio processing by the RSP cannot proceed in parallel with the RDP's graphics processing. Nevertheless, because I/O to RDRAM is smaller than with FIFO (around 1/2), this might be an effective way to counteract CPU/RDP slowdowns caused by competition on the RDRAM bus. So when using the XBUS microcode, please test a variety of combinations.

mips_r4300i · a year ago

I'm glad someone found objn64 useful :) looking back it could've been optimized better but it was Good Enough when I wrote it. I think someone added png texture support at some point. I was going to add CI8 conversion, but never got around to it.

On the subject of XBUS vs FIFO, I trialled both in a demo I wrote with a variety of loads. Benchmarking revealed that over 3 minutes each method was under a second long or shorter. So in time messing with them I never found XBUS to help with contention. I'm sure in some specific application it might be a bit better than FIFO. By the way, I used a 64k FIFO size, which is huge. I don't know if that gave me better results.

mips_r4300i commented on Sega Saturn Architecture – A practical analysis (2021) copetti.org/writings/cons... · Posted by u/StefanBatory

mrguyorama · a year ago

The other big problem with the N64 was that the RAM had such high latency that it completely undid any benefit from the supposedly higher bandwidth that RDRAM had and the console was constantly memory starved.

The RDP could rasterize hundreds of thousands of triangles a second but as soon as you put any texture or shading on them, the memory accesses slowed you right down. UMA plus high latency memory was the wrong move.

In fact, in many situations you can "de-optimize" the rendering to draw and redraw more, as long as it uses less memory bandwidth, and end up with a higher FPS in your game.

mips_r4300i · a year ago

That's mostly correct. It is as you say, except that shading and texturing come for free. You may be thinking of Playstation where you do indeed get decreased fillrate when texturing is on.

Now, if you enable 2cycle mode, the pipeline will recycle the pixel value back into the pipeline for a second stage, which is used for 2 texture lookups per pixel and some other blending options. Otherwise, the RDP is always outputting 1 pixel per clock at 62.5 mhz. (Though it will be frequently interrupted because of ram contention) There are faster drawing modes but they are for drawing rectangles, not triangles. It's been a long time since I've done benchmarks on the pipeline though.

You're exactly right that the UMA plus high latency murders it. It really does. Enable zbuffer? Now the poor RDP is thrashing read modify writes and you only get 8 pixel chunks at a time. Span caching is minimal. Simply using zbuf will torpedo your effective full rate by 20 to 40 percent. That's why stuff I wrote for it avoided using the zbuffer whenever possible.

The other bandwidth hog was enable anti aliasing. AA processing happened in 2 places: first in the triangle drawing pipeline, for inside polygon edges. Secondly, in the VI when the framebuffer gets displayed, it will apply smoothing to the exterior polygon edges based on coverage information stored in the pixels extra bits.

On average, you get a roughly 15 to 20 percent fillrate boost by turning both those off. If you run only at lowres, it's a bit less since more of your tender time is occupied by triangle setup.

mips_r4300i commented on Efinix Titanium Ti375 FPGA offers quad-core hardened RISC-V, PCIe Gen 4, 10GbE cnx-software.com/2024/03/... · Posted by u/mikhael

mips_r4300i · a year ago

The other Efinix chips are still sold as "has serdes" yet have not a single mention of it in the datasheets. At first I thought it was because they're still going through silicon qualification, but it's been 18 months and they're still TBD.

mips_r4300i · a year ago

Forgot to add, over that time frame that DID add MIPI documentation so they've got that working.

mips_r4300i commented on Efinix Titanium Ti375 FPGA offers quad-core hardened RISC-V, PCIe Gen 4, 10GbE cnx-software.com/2024/03/... · Posted by u/mikhael

jecel · a year ago

It is interesting that the MIPI, PCIe and Serdes columns in the table of device models is "-" for all of them. That is the case for the other Titanium family and nothing is mentioned about Serdes in the datasheet, only in the text description and table for the whole family.

mips_r4300i · a year ago

The other Efinix chips are still sold as "has serdes" yet have not a single mention of it in the datasheets. At first I thought it was because they're still going through silicon qualification, but it's been 18 months and they're still TBD.

mips_r4300i commented on JITX – The Fastest Way to Design Circuit Boards jitx.com... · Posted by u/Teever

mips_r4300i · a year ago

Good idea and concept. I fully support the idea.

Being able to specify a generic part requirement instead of hunting for a specific part is nice sometimes, but any company that makes more than a couple boards already has this covered with a bom management system.

Adding parts via text is nice and fast, but also glosses over many aspects of the part. Say I use a Diodes Inc buck regulator. It has a valid input voltage range it will accept. It has multiple ways it can be wired depending on the application. Wire for buck, septic, etc. PFM on/auto/off, etc. I don't see control over details like that.

Your About Us page is longer and more extensive than your actual product example page.

I see 4-5 very basic designs that I could bang out in Altium in under a day each. Are you selling to people unable to make PCBs? I look for ways to save time because I wear many hats in my job, only 1 of which is doing a board. However, I would not be able to save any time using this tool, because it would produce an inferior result. Additionally, after only 8 months of paying for this product, someone can already afford a full Altium license.

I want to save time on stuff like breaking out and pin swapping IO on a large FPGA. Handle DDR3 routing for me. These are things that actually take time, because you need to understand the device and read through tons of PDFs. However, I think that might also be the most difficult part to add to your product.

Finally, how does it handle physical constraints like non- square board outlines, mounting hole placement, and 3d STEP integration?

mips_r4300i commented on RP2040 Boot Sequence vanhunteradams.com/Pico/B... · Posted by u/vmoore

crote · a year ago

For those wondering about why there's both a boot ROM and the boot2 in flash:

The flash chips used support both a basic SPI mode, and an advanced QSPI mode. There is a well-defined standard protocol for basic SPI mode, so virtually all chips will respond to the same read command for simple slow byte-by-byte reading. The only thing left to try is the four SPI modes (Does clock idle high or low? Do we transfer on the full pulse, or on the half pulse?) - hardware often even supports two of them, and there's only one set which actually makes sense.

QSPI, on the other hand, is more of a wild-west. You need to run a bunch of chip-specific commands to enter QSPI mode, and there are quite a few possible variations for QSPI read commands, not to mention a lot of different timing requirements. Trying out all of them isn't really possible, hence the chip-specific boot2 segment.

Staying in SPI mode isn't really viable either because the application code is stored in the flash chip. To give an example, jumping to a random instruction would incur a 1280 ns read with a W25Q80BW flash chip operating in SPI mode (realistically x10 due to a lower safe clock frequency), whereas QSPI mode can reliably do that in as little as 125 ns. With the RP2040 running at 133MHz a 16-cycle delay for a random jump or a read from a data block is not too bad, but a 170 or even 1700-cycle delay is just way too much.

mips_r4300i · a year ago

Great info, and now some chips are supporting Octo-SPI which is even more vendor dependent. At some point we're basically back to parallel flash...