jasonni (u/jasonni) - Readit News

jasonni commented on WIP: Nvidia Parakeet ASR mode inference in GGML github.com/jason-ni/parak... · Posted by u/jasonni

jasonni · 24 days ago

I'm working on implementing Nvidia's parakeet tdt ASR model inference in GGML framework. The performance result compared to the MLX python version surprised me. My ggml implementation is 1000x slower than the MLX python version. Any help/comments/suggestions are welcome. THanks a lot!

jasonni commented on Sohu: The First Transformer ASIC etched.com/... · Posted by u/HCazlab

Zaheer · a year ago

The results sort of speak for themselves - custom ASIC's are the way of the future. How hard is it though for Nvidia to design a custom ASIC like this?

jasonni · a year ago

No one is sure that Transformer model is the final best structure. However, you can still use RTX3090 or RTX2090 to run AI models today, no matter the neural network structure is LTSM/RNN/Transformer. Programmablity and compatibility have more value in some aspect of economic considaration.

I hope in the future, when Chip manufacture cost is no longer the bottleneck of AI, we can have more options.

jasonni commented on Sohu: The First Transformer ASIC etched.com/... · Posted by u/HCazlab

mysterEFrank · a year ago

Do they bake in the actual weights or the architecture? If it's just the architecture I don't understand where a speedup that considerable can come from.

jasonni · a year ago

from their announcement, "Isn’t inference bottlenecked on memory bandwidth, not compute?", it seems weights are still in memory. It may have limit onchip cache for computing. Input tokens go through a batch pipeline to relieve memory bottleneck. Similar to Groq.

jasonni commented on Sohu: The First Transformer ASIC etched.com/... · Posted by u/HCazlab

ted_dunning · a year ago

ASIC just stands for Application Specific Integrated Circuit. So, yeah, it is like an FPGA, but takes longer to turn around a new version because you have to wait for somebody to etch you some silicon but you may get higher density than with the FPGA. You can do (very) small volumes at old densities for cheap, but if you are trying to track the front of the technology wave with commercially viable shipping quantities, you often need tens of millions of dollars per generation. This means that these folks have room for 1-3 generations before their money is gone.

LLMs that are attached to normal CPUs need lots of fast memory because they are doing very large matrix operations with very few arithmetic units which implies a lot of data motion. Changing that architecture might save on the need to move so much data, but it isn't at all clear what these people are proposing.

It also isn't at all obvious why their stuff would be any better than an ordinary vectorized arithmetic unit (often provocatively called a "tensor" chip).

jasonni · a year ago

In their announcing page, the section "How can we fit so much more FLOPS on our chip than GPUs?" tells some details. It's said "only 3.3% of the transistors on an H100 GPU are used for matrix multiplication". They trade off programmbility with computation density. And from the "Isn’t inference bottlenecked on memory bandwidth, not compute?" section, I guess they use similar tricks like Groq. Looking forward to more architecture details and comparation with Groq.

jasonni commented on Flameshot – Open-source screenshot software flameshot.org/... · Posted by u/nikolay

noisy_boy · a year ago

Great idea - I ended up experimenting to improve the ocr accuracy:

    #!/bin/bash

    screenshot=$(mktemp)
    decoded_data=$(mktemp)
    processed_data=$(mktemp)

    cleanup() {
        rm "$screenshot" "$decoded_data" "$processed_data"
    }

    trap cleanup EXIT

    flameshot gui -s -r > "$screenshot"

    convert "$screenshot" \
        -colorspace Gray \
        -scale 1191x2000 \
        -unsharp 6.8x2.69+0 \
            -resize 500% \
        "$screenshot"

    tesseract \
        --dpi 300 \
        --oem 1 "$screenshot" - > "$decoded_data"

    grep -v '^\s*$' "$decoded_data" > "$processed_data"

    cat "$processed_data" | \
        xclip -selection clipboard

    yad --text-info --title="Decoded Data" \
        --width=940 \
        --height=580 \
        --wrap \
        --fontname="Iosevka 14" \
        --editable \
        --filename="$processed_data"

jasonni · a year ago

Last year, when I want to find a tool to do the sanpshot and OCR job, I found flameshot. However, the OCR feature hasn't been added as native function due to some issues I'm not very clear.

So I spent some time added the OCR function into flameshot. I didn't choose to compile tesseract into flameshot, but using the rest api way to call a server running remotely. The reason for this way is I also added llama.cpp translation feature after OCR.

Here're github repositories for my fork of flameshot and the OCR and translation server which is written causually in Rust.

https://github.com/jason-ni/flameshot https://github.com/jason-ni/flameshot-ocr-server

jasonni commented on Show HN: Software for Remote GPU-over-IP github.com/Juice-Labs/Jui... · Posted by u/stevegolik

jasonni · 3 years ago

Glad to see a https://virtaitech.com/en/index competitor. As I know VirtAI doen't provide freeware. But they provide RDMA network and GPU pooling features. For guys interested in how this is done, I suggest have a look of https://github.com/ut-osa/gpunet and https://github.com/tkestack/vcuda-controller