onalark (u/onalark) - Readit News

onalark commented on After a $14B Upgrade, New Orleans’ Levees Are Sinking scientificamerican.com/ar... · Posted by u/selimthegrim

jingalings · 7 years ago

As a non-American, can someone explain to me why the army was responsible for this work? When does the normal tender/contractor process not occur? Is it a scale or political decision?

onalark · 7 years ago

It's complicated. The Army Corps of Engineers has had a civilian mandate to support flood control prevention since 1917 [1]. Beyond that, they are also involved in large public works projects such as the building of roads and bridges, and Superfund clean-up sites. On top of this, they regularly receive large pork barrel grants from Congress that can siphon money into a senator's state or a congressperson's district. They do have a large contracting arm and are actually pretty well-regarded for their comprehensive procurement and management process for these large public works projects.

So it's scale, politics, and history/momentum at this point.

[1] https://en.wikipedia.org/wiki/U.S._Army_Corps_of_Engineers_c...

onalark commented on Use perfect hashing, instead of binary search, for keyword lookup postgresql.org/message-id... · Posted by u/boshomi

nly · 7 years ago

Some guy called Ilan Schnell wrote a python code generator which uses the very same algorithm, and comes with language templates for Python, C and C++ (and is easy to expand to others), and a dot file generator so you can visualize the graph:

http://ilan.schnell-web.net/prog/perfect-hash/

He also has a superb illustrated explanation of the algorithm, a unique property of which is you get to choose the hash values:

http://ilan.schnell-web.net/prog/perfect-hash/algo.html

I've been using it for years.

onalark · 7 years ago

Ilan Schnell is not "some guy". He's the original primary author of the Anaconda distribution. One of the main reasons that so many data scientists use Python.

onalark commented on Grumpy: Go running Python opensource.googleblog.com... · Posted by u/trotterdylan

genericpseudo · 9 years ago

By volume numpy is mostly assembler written to the Fortran ABI (it's a LAPACK/BLAS-etc wrapper).

onalark · 9 years ago

NumPy is a library that provides typed multidimensional arrays and functions that run atop them. It does provide a built-in LAPACK/BLAS or can link externally to LAPACK/BLAS, but that's a side effect of providing typed arrays and is nowhere near the central purpose of the library.

Also, NumPy is implemented completely in C and Python, and makes extensive use of CPython extension hooks and knowledge of the CPython reference counting implementation, which is part of the reason why it is so hard to port to other implementations of Python.

onalark commented on Correcting Intel's Deep Learning Benchmark Mistakes blogs.nvidia.com/blog/201... · Posted by u/Smerity

iamleppert · 10 years ago

Why don't they provide a link to their testing methodology? They need to back up their claims (on both sides) with the actual configuration, all versions, and sample datasets for people to independently verify.

A docker container that runs their performance suite would be ideal.

onalark · 10 years ago

Except that Docker containers play terribly with virtualization solutions. Still, some sort of configuration/infrastructure-as-code would go a long way.

onalark commented on Cache-Efficient Functional Algorithms (2014) [pdf] cs.cmu.edu/~rwh/papers/io... · Posted by u/ingve

onalark · 10 years ago

This is a really interesting article and I'm glad this is getting attention. It's especially refreshing to see a theoretical treatment bridging algorithms with more "modern" hardware implementations.

One opportunity I do find in this article is real-world performance tests and determining the importance of constants in performance tuning. Prakop et al. did some really interesting work in 1999 on cache-oblivious algorithms, with applications to kernels such as the FFT and matrix multiplication. The work is theoretically extremely interesting, but in practice, hand-written or machine-generated algorithms continue to dominate. The most famous example of this is probably Kazushige Goto, whose hand-optimized GotoBLAS (now under open source license as OpenBLAS) still provides some of the fastest linear algebra kernels in the world.

If you're interested in learning more about how the differences between the two approaches shake out in linear algebra, I recommend "Is Cache-Oblivious DGEMM Viable?" by Gunnels et al.

onalark commented on Scientist: Measure Twice, Cut Over Once githubengineering.com/sci... · Posted by u/jesseplusplus

noobiemcfoob · 10 years ago

Overall, I love this type of approach. We've begun doing something similar at work as well.

However, I don't get the restriction on code with side effects.

Would it not be possible to introduce another abstraction layer around those side effects to allow comparison between the old code's side effects and the refactor's code side effects?

onalark · 10 years ago

I don't think this would work for a number of reasons. If it's a database that you're modifying, you can see that a lot of operations (increment, delete, etc...) will do the wrong thing if they're called twice. If the operations themselves are idempotent, you wouldn't be able to verify that the intended side effect was correct. This is one reason developers spend a lot of time building mock objects: to capture "side effects".

onalark commented on Scientist: Measure Twice, Cut Over Once githubengineering.com/sci... · Posted by u/jesseplusplus

onalark · 10 years ago

Awesome, I'm a huge fan of new and innovative tools that help improve the process of refactoring and improving existing code. This looks like a really promising tool for Ruby developers, and I'm always grateful when companies and their employees invest the time and effort to release their tools to the community. I really liked the point about "buggy data" as opposed to just buggy code, I think that's a really important point.

A few reactions from reading through the release:

Scientist appears to be largely limited to situations where the code has no "side effects". I think this is a pretty big caveat, and it would have been helpful in the introduction/summary to see this mentioned. Similarly, I think it would be nice to point out that Scientist is a Ruby-only framework :)

You don't mention "regression test" at any point in the article, which is the language I'm most familiar with for referring to this sort of testing. How does a scientist "experiment" compare to a regression test over that block of code?

Anyway, thanks again for writing this up, I'll be thinking more about the Experiment testing pattern for my own projects.

onalark commented on Quiver: Programmer's Notebook for OS X happenapps.com/#quiver... · Posted by u/moonlighter

onalark · 10 years ago

Shiny, and I love the interface/layout!

This looks a lot like the Jupyter/IPython Notebook, which is a free and open source "scientist's notebook". If you're interested in mixing LaTeX, Markdown, and code from almost any language (Python, R, and Julia are very well-supported but there's an open kernel spec), then this might be a more appropriate tool for you to use.

The Jupyter/IPython notebook default storage format is JSON, which makes it a little more friendly for text-based version control, and also enables a static HTML view of notebooks (http://nbviewer.jupyter.org/github/ketch/teaching-numerics-w...) on GitHub.

Helen Shen wrote up a great article for Nature (http://www.nature.com/news/interactive-notebooks-sharing-the...) on how scientists are using the notebook, but it also provides a good overview of how you might use it, as well as a free interactive demo.

onalark commented on Julia Computing Granted $600k by Moore Foundation moore.org/newsroom/in-the... · Posted by u/one-more-minute

sandGorgon · 10 years ago

whenever there is R compared to any other analytics framework, the number one advantage is cited to be CRAN. And the counter is always that "hey python has wheel".

Does anyone think that a dedicated package site for analytics libraries is what is needed ?

From the R documentation:

In the simplest form, an R package is a directory containing: a DESCRIPTION file (describing the package), a NAMESPACE file (indicating which functions are available to users), an R/ directory containing R code in .R files, and a man/ directory containing documentation in .Rd files

But a python wheel is considerably more complicated.

onalark · 10 years ago

Have you looked at conda and http://anaconda.org? We spent a lot of time curating the most important Python packages for data science into the Anaconda distribution, and conda packages are a great format for distributing complicated software.

onalark commented on Making thumbnails fast engineering.khanacademy.o... · Posted by u/ka-engineering

aktiur · 11 years ago

Numba is indeed pretty impressive, but you're not comparing exactly the same thing with this code.

In the Numba case, you're basically modifying the image in place: it means no allocating a new array, no full copying. However, your pure-numpy code basically creates a new array (the result of np.dot) before copying it back entirely in image.

If you write the two functions so that they both return a new numpy array and do not touch the original one, the time difference drops from 4 times faster to 2.5 times faster. That's still an impressive difference, but at the loss of a bit of flexibility.

https://gist.github.com/aktiur/e1cddee8f699ded49824

N.B.: numpy.dot does not use broadcasting, i.e. it does not allocate a temporary array to extend the smaller one. The function handles n-dimensional arrays by summing on the last index of the first array, and on the second last of the second array.

onalark · 11 years ago

Thanks, I clearly wasn't being careful. I'll update my Gist...

edit: On reviewing, I think the intent of the original blog post was to modify images in place (or at least to do it as quickly as possible with in-place filtering ok). In that case, I think my comparison is fair, since NumPy doesn't offer a faster way to do the requested operation. I didn't try out einsum, but I think Numba would outperform that as well.