Readit News logoReadit News
latchkey · 2 years ago
I built a whole remote software update mechanism for a control binary that ran on 25k+ servers across multiple data centers.

Rest assured that after the first time I messed it up (which required ssh into each box individually), I wrote a lot of unit and integration tests to make sure that it never failed to deploy again. One of the integration tests ensured that the app started up and could always go through the internal auto update process. This ran in CI and would fail the build if it didn't pass.

While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.

foobiekr · 2 years ago
Rivian is an embedded use case, though, which is not at all like a fleet of servers.

Having worked for companies that produce network devices - including devices that are unreachable for example for 6 months of the year - and on software installation and upgrade, I am baffled how this bricking is possible. For one thing, you generally use some kind of confirmed boot mechanism - you upgrade a standby partition, set an ephemeral boot value that causes device to boot the alternate image, and reboot - only when the image is declared "up" does that get persisted (and then the alternate is upgraded, in order to prevent rollback in the event of a media error). You use watchdogs that are tied to actual forward progress (and not just some demon that the kernel schedules and bangs on the watchdog even if the rest of the system is hung) and if they fail, the WD reboots you. (This is one of the reasons that event driven programming is somewhat preferred - actually processing events from a single dispatch thread makes it easier to reason about the system.)

On top of that, you make sure that the core system is an immutable filesystem so that you can validate the _offline_ alternate image before rebooting (write-and-read-back-uncached) and periodically scrub the alternate image (same).

Like.. this is all embedded 101, stuff people have been widely doing since the mid 1990s and I think I can find examples going back to the 70s. Sometimes you get a little more sophisticated (allow sub-packages or overlays and use a manifest to check the ensemble instead of just a single image), but it's very standard.

dcow · 2 years ago
Assuming Rivian does know embedded 101, my guess is that the infotainment system is running Android and the watchdog reported all green once the system services all came online and that it doesn't actually check whether the application layer is really working because, as you know, that would require the watchdog to run a full regression suite before giving the okay, which isn’t practical. Since the update swapped the system to an internal dev cert, they cant push an immediate update to change the boot args because the management plane daemon won’t connect to the C&C server, or it can but the blob they push wouldn’t pass signature validation, or the TEE won’t unlock the device keys because the roots changed. Whatever the case, someone has to go blow a fuse and re-flash the thing, or at least rewrite the boot args via serial. Just a guess.

If it is the most likely “management plane TLS certs” issue, I bet the watchdog won’t confirm the new boot args until the command dispatch daemon gets a pong from the C&C server moving forward (:

ikiris · 2 years ago
That sounds out of scope for the MVP. We can worry about redundancies later after we ship.
KingMachiavelli · 2 years ago
Did you just use standard Yocto or similar tools to build such images? Are there standard daemons for managing hardware watchdogs (besides systemd since that's too simple as you say)? I think there's a lot of niche knowledge in the embedded space and many programmers are used to cloud systems and at most target. The most embedded experience most programmers have is likely iOS/Android development where all of the actual embedded concerns are handled for you. Even Google (soft)bricked a bunch of phones with the latest Android 14 update [1].

IMO there's not a lot of regular OSS for building embedded systems that comes with A/B partitioning, watchdogs, secure and verified boot - it's all custom at every org and tailored for individual products.

[1] https://arstechnica.com/gadgets/2023/11/android-14-patches-r...

neuralRiot · 2 years ago
> including devices that are unreachable for example for 6 months of the year

That made me think, imagine NASA bricking up the voyager with a SW update.

aaronbeekay · 2 years ago
As somebody currently working at an automaker on software systems, the amazing thing to me is that a mess up of this level doesn’t happen weekly. It’s rough out here.
jacquesm · 2 years ago
Thank you. At least you're honest about it, the other day someone was trying real hard to convince me that software developers at automakers are made of magic fairy dust.
bozhark · 2 years ago
What's the priority then, telemetry data? Why is it rough out there?
foobiekr · 2 years ago
do you guys not have confirmed boot and swizzling to fallback images?
cjbprime · 2 years ago
> This ran in CI and would fail the build if it didn't pass.

I don't mean to be pedantic, but since we're talking about what should happen instead, this is insufficient. It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

People should do what you described in CI, but as well as that, you need phased rollout, where e.g. the build can only be rolled out to the next percentage point of randomly selected users in a specific segment (e.g. each hardware revision and country as independent segments) after meeting a ratio of successful check-ins, in the field, from the new build by production customers in that segment. That's the actual metric for proceeding with the rollout: actual customers are successfully checking in from the new version of the software.

Except, that's actually not sufficient either. What if the new build is good, but it contains an update to the updater which bricks the updater? Now you're getting successful check-ins from the new version in the field, but none of those customers will ever successfully auto-update again. So, test the new updater's ability to go forwards successfully, too.

quailfarmer · 2 years ago
A good way to handle the who-updates-the-updater issue is to use a triple partition updater. A updates B, and then B updates C, then C updates A. If anything about the new version prevents it from properly updating its neighbor, that neighbor won't be able to close the loop, and you'll fall back to A. This simplifies the FSBL, because it just boots the three partitions in a loop, no failure detection required. You don't need to triplicate the full application either, just the minimum system needed to perform an update, and then have the "application" in it's own partition to be called by the updater.
latchkey · 2 years ago
> It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

Nah, my CI process was solid. This was proven in the field over the course of years.

> I don't mean to be pedantic... you need phased rollout

You don't need to be pedantic, but better to ask the question rather than assume that was all that I did. =) You have to realize that what I built, worked flawlessly. It wasn't easy either, took a lot of trial and error.

I did have a CIDR based rollout. I could specify down to the individual box that it would run a specific version. Or I could write "latest" to always keep certain boxes running on the latest build. This was another part of my testing, but ended up not being fully necessary because I had enough automated testing in CI that "latest" always worked.

> but it contains an update to the updater which bricks the updater?

This happened, so I wrote a lot of test code to make sure that would never happen again. My CI would catch that since I was E2E testing that it could actually run the upgrade process.

Once I implemented all of this, I never had a single failure and would routinely, several times a day, deploy to the entire cluster, over the course of a couple years.

It was all eventually consistent as I could also control the "check for update" frequency as well.

jacquesm · 2 years ago
And you need to verify the vehicle is not in motion.
psychlops · 2 years ago
Having worked on 25K machines, I can assure you that it never deployed to every single machine and failed to do so in interesting ways all the time.
latchkey · 2 years ago
It always deployed. It was eventually consistent. Any failure would automatically be resolved after a period of time.
postalrat · 2 years ago
As a frontend web developer I'm constantly deploying software to many thousands of machines. And you know what? It's pretty damn simple.
donmcronald · 2 years ago
> While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.

I feel like it's going to happen to someone that makes network devices eventually. I'm always scared to update my (several hundred) UniFi devices. Their update process isn't foolproof and they push auto-updates via the UI pretty hard.

Several years ago they caused some people's devices to disconnect from the management controller when they enabled 'https' communication. Prior to that, if you were pointing devices at 'https://example.com:8080...' they would ignore the 'https' part and do an 'http' request to port '8080'. Then they pushed their 'https' update which expected an 'https' connection and didn't fall back to the old behavior for anyone that was mistakenly using 'https' in their URL initially. Some people on their forums complained about having to manually SSH to every device to fix the issue.

It was caused by an end-user mistake, but they knew it was a potential issue. AFAIK, their attitude on it hasn't changed and a lot and at the time their response was that they knew it would break some people, but that it wouldn't be that many (lol).

IMO, the issue with those systems is that basic communication back to the update / config server is part of the total package which is too complex (ie: a full Debian install). I'd rather see something like Mender (mender.io) where the core communications / updates come from a hardened system with watchdog, recovery, rollback logic.

Think of how crazy it is to have something like pfSense doing package based updates rather than slice based updates. At least with boot environments they could add some watchdog and rollback type logic, but it'll still be part of the total system instead of something like a hardened slice based setup where the most critical logic is isolated from everything else and treated like a princess.

Do you have any insight on package vs slice based systems for updates? Did you isolate update logic from the rest of the system or am I out of touch with that opinion?

vGPU · 2 years ago
Reminds me of my (far less critical) update process for home assistant. Every time something breaks. Currently my hvac automations are going haywire.
akira2501 · 2 years ago
When possible, I used a fail back mechanism. If the update failed to fully come up, then the watchdog timer would catch it, the bootloader would notice the incomplete boot, and attempt to boot from the previous known working image in that case.
code_runner · 2 years ago
out of morbid curiosity.... how long did it take to ssh into and fix all of those servers? I imagine even automating a fix (if possible) would still take a good amount of time.
latchkey · 2 years ago
gnu parallel and sshpass is your friend.

The way I built my app was that I could install it cleanly via a curl | bash.

So, I just had a simple shell script that iterated through the list of IP addresses (from the DHCP leases), ran curl | bash and that cleaned up the mess pretty quickly.

jdechko · 2 years ago
As a non-developer, the whole situation with a bad software update to the Voyager spacecraft really puts things into perspective as far as how bad remote updates can be.

It’s also a testament to the way that the system was designed that they were able to get it back online.

sixtram · 2 years ago
you ssh-d into 25K servers one by one? I mean, manually?
ugh123 · 2 years ago
Please tell me you scripted that ssh into across your 25k servers!
latchkey · 2 years ago
https://news.ycombinator.com/item?id=38270986

One thing my little control process did on the box was to always set the password to be the same... user/1.

None of these boxes needed inbound connections, so it wasn't a big deal to do that.

gravitronic · 2 years ago
I used to work for a company that built satellite receivers that would be installed in all sorts of weird remote environments in order to pull radio or tv from satellite and rebroadcast locally.

If we pushed a broken update it might mean someone from the radio company would have to make a trip to go pull the device and send it to us physically.

Our upgrader did not run as root, but one time we had to move a file as root.. so I had to figure out a way to exploit our machine reliably from a local user, gain root, and move the file out of the way. We'd then deploy this over the satellite head end and N remote units would receive and run the upgrade autonomously. Fun stuff.

Turns out we had a separate process running that listened on a local socket and would run any command it received as root. Nobody remembered building or releasing it but it made my work quick.

singleshot_ · 2 years ago
The person who built and released this might not have ever worked for your company, which might be why no one remembers building or releasing it.
gravitronic · 2 years ago
No no, I figured that out afterwards, in a past development iteration someone added it on purpose and then forgot all about it - "oh yeah we needed that to <solve some mundane problem>".

So... worse than subterfuge? That being said it only listened on the local socket, so it's slightly less bad, and I don't want to get into the myriad of correct ways that original problem could have been solved, but lets just say that company doesn't exist anymore.

cjbprime · 2 years ago
I admire your restraint in writing this comment. :)
ThePowerOfFuet · 2 years ago
This is one of the very finest comments I have ever seen on HN (or anywhere else, for that matter).
nomel · 2 years ago
> Turns out we had a separate process running that listened on a local socket and would run any command it received as root. Nobody remembered building or releasing it but it made my work quick.

No offense, but what a shit show. It makes me assume no source control, and a really good chance that state actors made their way into your network/product. This almost happened at a communication startup I know, with three letter agencies helping resolve it. State actors really like infiltrating communication stuffs.

gravitronic · 2 years ago
oh, yeah, this place was a total shit show. BUT we were ISO9001 certified!! So we had source control (CVS) and a Process (with a capital P) to follow. In this case that code was added in a previous development iteration because someone needed to run something as root when a user pressed a certain button on the LCD panel in front and this was the decoupled solution they wrote intentionally. Somehow I feel like that makes it worse than if it was a malicious three letter agency lol.
qmarchi · 2 years ago
It's crazy to me that this is possible in the first place. Standard practice is to have a fleet of test vehicles that are effectively production except in an early release group.

Or, you know, having an A/B boot partition scheme with a watchdog. Things that have been around for decades at this point.

Disclaimer: Former Googler, Worked closely with Automotive.

michaelt · 2 years ago
To me it's all-too-understandable how this is possible.

Maybe they've got a test fleet, but it accepts code signed with the test build key.

Maybe they've got a watchdog timer, but it doesn't get configured until later in the boot process.

Maybe they've got A/B boot partitions, but trouble counting their boot attempts - maybe they don't have any writable storage that early in the boot process.

I wouldn't be surprised if, as a newer company, they'd made a 'Minimum Viable Product' secure boot setup & release procedure, and the auto-fallback and fat-finger-protection were waiting to get to the top of the backlog.

qmarchi · 2 years ago
So, using Polestar as a reference as it's both a vehicle that I've worked on, and one that I personally drive.

> Maybe they've got a test fleet, but it accepts code signed with the test build key.

Polestar solves this by only delivering signed updates to their vehicles. The vehicle headunit will refuse to flash a partition that isn't signed by the private key held by Polestar. Pulls double duty to prevent someone from flashing a malicious update, as well as corruption detection.

> Maybe they've got a watchdog timer, but it doesn't get configured until later in the boot process.

Based on what the Rivian reports are showing (Speedometer, cameras, safety systems are working), they likely are running their infotainment as a "virtual machine" within their systems. Again, something that Polestar does.

Implementation of a watchdog with a "sub-system" like this is relatively braindead simple.

> Maybe they've got A/B boot partitions, but trouble counting their boot attempts - maybe they don't have any writable storage that early in the boot process.

Generally, A/B partitioning is part of the bootloader, the first program that executes after the reset (on many modern processors) pin is released. This also leads to reboot counters and such being stored as part of the NVRAM that is available at boot.

Opinion: Maybe I'm biased, but maybe if you can't develop something yourself, there's reason for you to get an off the shelf option that handles a lot of these things.

Disclaimer: Former Googler, Worked closely with Automotive.

LoganDark · 2 years ago
> Maybe they've got A/B boot partitions, but trouble counting their boot attempts - maybe they don't have any writable storage that early in the boot process.

You do not report a successful boot until and unless the entire system loads up successfully. You will definitely have writable storage by then.

psunavy03 · 2 years ago
Exhibit A of why a Minimum Viable Product still needs a proper Definition of Done which includes quality standards.
worik · 2 years ago
What amazes me is that any grown up person thinks it is a good idea to update vehicles as if they were telephones

Owners should have to bring the vehicle into a shop to have changes made, and they should be very rare.

This lazy, control freakery of the worst kind

Something very bad is going on happen and people will die before we realize that it is a stupid dangerous practice

qmarchi · 2 years ago
I understand the sentiment, but think about the alternatives.

There are a few different kinds of updates that can be applied, each with their own protective layers.

Infotainment updates, like what happened to Rivian aren't that dangerous. You lose "convienience features" like maps, air con, etc, but generally nothing that could kill you or someone else.

Then there's system updates, which is where danger noodle things happen. Automotive manufacturers are significantly more risk averse to updating these components, and generally, if _anything_ within the system looks wonky, it's an immediate revert.

If I, as a Polestar owner, wanted to get an update for my vehicle, the nearest service center is 1.5h away. If I lived in Montana (United States), it would be realistically impossible for me to update my car. Thus, if we want to enable competition within the markets, we shouldn't have regulations that force a new manufacturer to have a global network just to add CarPlay to a screen.

LastMuel · 2 years ago
On the other hand, we update irreplaceable spacecraft billions of miles away with new software.

It should be fine to push software updates out, as long as the correct safety and fallback procedures are in place. It simply has to be designed to handle failure and procedures need to be in place to mitigate risks.

It sounds like that wasn't the case here. Also, why wouldn't you have a small initial release pool when you have such a large potential for disruption?

bradleyjg · 2 years ago
The art of shipping software—like on a disk, where once it’s out the door, it’s out the door and you may never get another shot—is dead or dying. Even in some embedded areas of the industry now.
fargle · 2 years ago
> What amazes me is that any grown up person thinks it is a good idea to update vehicles as if they were telephones

What amazes me is that any grown up person thinks it is a good idea to update telephones as if they were software and not phones.

Or rather that it is a good idea to have phones that need updates? Either way, we're all one 1/2 assed push update to a fridge, vacuum, washing machine, phone or car away from a really annoying day.

vore · 2 years ago
As the update only affects infotainment and not critical systems, it seems like a reasonable tradeoff to me. Just because a car can fail in ways that kill people doesn't mean all parts of a car are equally critical.
spaceywilly · 2 years ago
Yeah... I worked on an embedded project with literally 2 engineers, and we had an A/B partitioning scheme, and a recovery partition (we fully qualified the recovery image and it was flashed to the units on day 1, it was guaranteed to boot and it would just sit and wait for the user to initiate a firmware load). The app on the device would reset a U-boot variable once it was successfully loaded, so U-boot could check the number of failed boot attempts. If it was >= 5 reboot attempts without booting successfully, it would go into the recovery partition.

There's really no excuse from Rivian on this, this is shoddy

LargeTomato · 2 years ago
I interviewed at Rivian. They told me about how they needed to grant users access to things like keys, AC, ignition, etc. So they built a hierarchical, recursive group checking IAM system.

That just felt like a massive product to build and maintain for what really could have been backed by AWS iam. GCP IAM if they really really needed hierarchy. I guess I'm not surprised at this outage.

DannyBee · 2 years ago
Rivian does have a test fleet, and they test it for weeks before releasing. This particular issue is because they apparently distributed the firmware signed with the wrong cert.

Not a bug in the software itself.

That is independent of testing the software, but still a distribution issue.

mytailorisrich · 2 years ago
My 2c based on your comment:

* "signed with the wrong cert" should mean the software package is rejected before it it is installed.

* software upgrades are tricky and there should be at least 2 versions available so that fallback to the previous is possible and automatic in case of issues.

jandrese · 2 years ago
Yeah, but how did the vehicle not just reject the wrong cert and refuse to flash the update?
mlyle · 2 years ago
The code went through early release tests successfully; the problem came with how it was more broadly released.

They should have had further staging of the rollout (randomizing when it is offered to users).

whalesalad · 2 years ago
A/B partitions tends to solve that. You will only switch to the new partition when the update is 100% verified installed. If it doesn't complete in an atomic manner, your device will just boot into the previous healthy partition.
MichaelZuo · 2 years ago
The 'early release tests' weren't testing an identical copy of the actual update?
hef19898 · 2 years ago
I am still not sure why I would update software on car, a piece of hardware that, IMHO, shoupd be able to run air gapped 24/7. Exceptions: recurring bugs, GPS maps and security updates. All of which can be done either during service (preferred, if they brick it, they are liable) or by plugging in something. OTA updates just seem completely pointless...

Edit: Also, why the heck isn't the entertainment system completely air gapped from the software running the car?

refulgentis · 2 years ago
Rollouts don't solve problems, they limit who they effect.
xyst · 2 years ago
When a car company is losing money on every car sale. C level execs going to cut corners
dewski · 2 years ago
This is a bad take.
cs702 · 2 years ago
It's easy to underestimate how hard and expensive it is to build, deploy, and remotely upgrade software that runs reliably on a fleet of diverse cars (different models, different years, slightly different components from batch to batch, etc.). It makes updating a mobile phone OS look trivial in comparison.

So far, only Tesla seems to be able to update car software remotely, regularly and reliably. I'm certain it's neither easy nor cheap.

All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!

VyseofArcadia · 2 years ago
Forget updates entirely. My car is one of the few places I expect to get software that works the first time.

If you absolutely must have updates, then at least not OTA updates. Have them done at the dealership or service center so any issues can be dealt with immediately.

Come on, is this engineering or hacking? This is a car, not a CRUD app. Get. It. Right.

dagmx · 2 years ago
That’s how things used to be and it resulted in lots of long standing bugs because the update rates were low, and so manufacturers didn’t push updates. Many people don’t live near dealers or service centers or can afford the continued cost (it’s not free usually unless it’s a recall)

OTA is better for consumer when done properly. Other manufacturers manage it fine, and one bad example shouldn’t be what we base things on. It’s what we should learn from and improve on.

dalyons · 2 years ago
eh i guess i disagree. We had that (& still do for some cars) for decades, and it universally resulted in terrible software that you were stuck with for the life of the car. Hard to update == hard to iterate == bad software.
w0m · 2 years ago
random new features via OTA updates was one of the deciding factors when i bought my car ... :)

I also mostly WFH so... yea. lol.

matrss · 2 years ago
> All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!

I am pretty sure there is a market for a dumb modern car, but no one is building it. I am thinking of an electric car without anything "smart" in it. Modern safety features can stay, if they work completely self contained and without requiring an external connection ever over the lifespan of the car.

iso8859-1 · 2 years ago
I wonder if it is somehow possible to use an open source battery management system to build a car like this. See https://foxbms.org/
jacquesm · 2 years ago
Regulatory pressure may well get you to do stuff you wouldn't want to do.
NotYourLawyer · 2 years ago
I’d buy that car today.
wannacboatmovie · 2 years ago
This isn't a bunch of Windows PCs home-built from a hodgepodge of components.

They designed, built, and shipped all the hardware. There is ABSOLUTELY NO excuse for not having a database of the exact hardware configs by serial number. They have the ability to test every single shipped configuration.

If they don't, they have already failed as a car company.

AlotOfReading · 2 years ago
I guarantee they have a database with the hardware configs. It's required by NHTSA to do recalls and notices. They'll undoubtedly be using that to inform the right people to come in.

The update servers almost certainly don't talk to that system though.

wil421 · 2 years ago
> So far, only Tesla seems to be able to update car software remotely, regularly and reliably. I'm certain it's neither easy nor cheap.

My Jeep Grand Cherokee has OTA for over 5+ years. BMW has been doing it since 2018.

I’m almost positive a family member had it with GMC on star back in the late 2000s.

willio58 · 2 years ago
I don't think the Jeep or BMW infotainment systems are nearly as fleshed out or complex as Rivian's, especially not Tesla's. Maybe I'm wrong!
bri3d · 2 years ago
> All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!

Almost all automotive control modules have firmware, whether that firmware is parsing touchscreen inputs or a rotary encoder.

NotYourLawyer · 2 years ago
Well sure, but the rotary encoder can’t get moved to a different menu tree by a software update, and I can use it without taking my eyes off the road. I know which I prefer.
xyst · 2 years ago
Tech junk shouldn’t go in cars, period. Cars shouldn’t be as pervasive and prevalent in society (at least in USA). Yet here we are. Car manufacturers have spent an insane amount of money over decades to get to this point (buying legislators, forcing highway infra, subsidies, profit driven strategy over sustainability)
treesknees · 2 years ago
Decades? Try almost a century. For better or worse, our cities and various economies were built around the automobile.

It's still a free market - these companies could choose not to put tech into their product. But look at the backlash against GM when they announced they wouldn't support Apple Car Play or Android Auto. Consumers want it.

FireBeyond · 2 years ago
> So far, only Tesla seems to be able to update car software remotely, regularly and reliably. I'm certain it's neither easy nor cheap.

Tesla, whose computer systems quite regularly need to be hard rebooted while the car is driving? That Tesla?

code_runner · 2 years ago
I had to do this once or twice (its very very infrequent in my experience) and one time it was genuinely terrifying, as I had lost blinkers etc where a few interstates all intersect and merge etc.

I still do love the car though.... but a very sketchy moment that I shouldn't have brought on myself while driving in that situation.

Aurornis · 2 years ago
> (different models, different years, slightly different components from batch to batch, etc.). It makes updating a mobile phone OS look trivial in comparison.

Not really. Vehicle computers aren’t vastly different on every model year and every trim level or option package. These parts are standardized, tested, and carried across model years.

Even with changes, the teams would be expected to have the different variants in their development and test cycles. The 2020, 2021, and 2022 model infotainment systems likely share a lot more in common than an iPhone 13, iPhone 14, and iPhone 15 with all of the non-Pro, Pro, and Max variants.

etchalon · 2 years ago
My Volvo XC90 gets regular OTA updates without issue, and so did my Land Rover Discovery before it.
tech_ken · 2 years ago
> All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!

If it ain't broke it's ripe for disruption

dylan604 · 2 years ago
if (cpu == A) do code

else if (cpu == B) do other code

They invited the multiple combination vampire into their house. They know what devices are being used. If you don't want a dedicated update per piece of equipment, it'll be a large binary with lots of branching. Saying they don't know what device is where is just lazy. Ask the device what it is, and have a branch for it. If the device IDs itself as something unknown, don't do anything.

xgbi · 2 years ago
From what I read somewhere, Tesla was able to do that because they have remote ssh capability.

In at least one instance, they fixed the cars manually by running a massive remote command on all cars after a messed up update: https://lobste.rs/s/v42zil/former_tesla_employee_ssh_d_as_ma...

I wouldn’t call that very reliable , but they indeed do it regularly

FireBeyond · 2 years ago
And it's not like they'd ever abuse that ability, like when someone pokes around in their car and discovers references to a new unannounced model, and then Tesla reaches in, force downgrades the vehicle to older software with no references, and then disables the ethernet port on the vehicle, and for a final fuck you disables its ability to ever get another update.

They'd never do that, except when they did do that.

SoftTalker · 2 years ago
It sounds like, in this case, the updates clobbered the ssh authorized keys (or equivalent in their system) and so now they cannot access the cars remotely. So they are going to have to go into the shop and have the authorized keys restored.
scardycat · 2 years ago
Bringing CI/CD mindset to cars is probably not a great idea. Software updates to commuter vehicles should have a high bar for operational standards, and a simple thing such as an expired certificate should have never been deployed. Having isolated networks in vehicles helps but doesn't prevent broken updates from, eventually, bricking the cars.
nomel · 2 years ago
I think this shows more of a fundamental flaw in their update mechanism, than anything.

I don't think a botched update is a big deal. It happens, and should be expected, in a sane design. The fact that the customer noticed is a big deal.

There are many implementations that could be used for an "auto rollback" feature. They either failed to implement that in a sane way, or they were goobers, and assumed things would always be rosy.

babypuncher · 2 years ago
I would be pretty pissed if I went out to my garage to head to work one morning and found that a damn software update bricked my car overnight. This shouldn't even be a thing, why does a car need regular software updates to keep functioning?
gitfan86 · 2 years ago
The Tesla update is slow probably for this reason. It is probably verifying that it can rollback at any point of failure.
1234letshaveatw · 2 years ago
From a few days back- Its software has been a “key differentiator” https://electrek.co/2023/11/10/rivian-using-software-to-scal... kind of humorous in hindsight
wannacboatmovie · 2 years ago
Interesting to note that Ford's approach of updating software is far more conservative and car-like. It can be done fully offline via USB, but requests that you kindly upload the log files written to the memory stick back to them when complete, in the instructions as a necessary step. Presumably so they can track and stop incidents like this before they happen fleet-wide.

Rivian seems more like a "ship it and we'll fix it in the next sprint!" company.

How do other manufacturers handle updates?

post_break · 2 years ago
Fords approach is flawed however. You can still update sync with a bad update and bork it over usb. Ask me how I know.
r00fus · 2 years ago
Pray tell, how painful was your discovery?
sturza · 2 years ago
A/B partitions
barryrandall · 2 years ago
The last time I built something like that, it used partition 1 for the current version, 1 for the last version, 1 with the as-shipped version, and 1 that could restore A or B from the internet or USB.
reneberlin · 2 years ago
When will humans be crazy enough to update the firmware of artificial hearts OTA?

Updating cars with new features OTA, even "just" an Infotainment can possibly cost lives, because the driver might get confused and isn't putting eyes on the streets.

It should be forbidden and every change should be made clear to the driver, shown in detail, and should need verification twice before being accepted. There must not be any kind of surprise in a car for the driver.

It should even be possible to skip an update or stop updating at all.

rekoil · 2 years ago
Not updating cars OTA (yes, even "just" the infotainment) can potentially cost lives as well, as security holes would not get patched until the next service appointment.
qudat · 2 years ago
What a nightmare. This is where software engineering meets "real" engineering, where a "bug" has potentially life threatening consequences.
nomel · 2 years ago
> where a "bug" has potentially life threatening consequences.

What are you referring to? That is not relevant to this story, and would require a deep understanding of the system to make such a claim of negligence.

“The issue impacts the infotainment system. In most cases, the rest of the vehicle systems are still operational ...”

Also, you can't do an update while driving.

jawns · 2 years ago
Based on the photo included in the article, what they're calling an infotainment system is actually two separate components, one of which appears to be taking the place of a traditional dashboard. If that's the case and there's no other way to monitor speed, fuel levels, engine temperature, warning lights, etc., I'd say that's quite a bit more worrisome than just not being able to play your favorite music while driving.
ct0 · 2 years ago
You've never been to death valley without air conditioning Or Russia without heat. I think the infotainment system in this case has a broken climate control function. There are workarounds, but why if you don't have your phone?
qudat · 2 years ago
> What are you referring to?

Not the specifics of this article, but more generally about the gravity of the situation car makers (and their software engineers) operate under. The very idea that an OTA software update that causes a bug within more critical features of a car could be life threatening. So my point isn't about the specifics of this particular bug, rather the capacity for a bug that could kill.

nunez · 2 years ago
critical safety systems/functions appear to be unaffected by this outage.