Tackling Technical Debt and Improving Security: How Alto Untangled an Overloaded User Model

Nov 15, 2022

Alex Toombs, Senior Security Engineer

Tackling Technical Debt and Improving Security: How Alto Untangled an Overloaded User Model

At Alto, we have three types of users: customers, healthcare providers (nurses, doctors, etc), and operations users who run the pharmacy. For each type of user, we maintain a separate application (though they share a monolithic backend for now): our customer app; our Alto Connect app for providers; and our internal pharmacy management tool called Wunderbar, which powers the pharmacy experience from prescription intake to fulfillment.

Early on, we made the pragmatic choice to represent both customers and operations users with a single User model, distinguishing them with a “roles” attribute. This required repetitive checks that, over time, slowed down our engineering teams, presented substantial privacy risk, and posed difficult scaling challenges as we evolved our monolith into different microservices.

Last year, we set out to tackle this technical debt by splitting the User model into separate concerns and refactoring the entire system into this new paradigm.

The challenges of relying on a single User model for different types of users

The notion of a User is core to our concept of identity from which authentication and authorization proceeds. From a security perspective, it’s difficult to build an extensible and clean authentication and authorization system when you’re overloading one model for two very different populations of users. From a privacy perspective, this muddied customer records (which are subject to more restrictions because they’re healthcare records) with workforce records, hamstringing many analytics efforts.

Authentication: Alto employees who were also Alto customers were required to use their work emails for their customer accounts. Not only was this restriction frustrating for Alto employees, but it also posed a security risk: employees should only be able to access Wunderbar via corporate SSO, not their personal account passwords.
Authorization: Engineering new authorization capabilities required intricate knowledge of the idiosyncrasies of our user structure. This both slowed down development and created tremendous risk that any mistake could easily escalate user permissions.
Privacy: Because they existed in the same table, both customer records and Alto operations user records were classified as Protected Health Information, or PHI, which severely limited what we were able to do with them.
Attribution: Our robust audit logs were significantly less useful when a user ID in the transaction log might refer to either an ops user or a customer.
State management: Some updates pertaining to admin identity lived on the customer model, some lived on the employee model, and others lived in other various models.
Architecture: Tight coupling would have prevented us from decoupling our monolithic system into smaller services.

What were we working with?

Our backend is primarily a monorepo Rails application that shares a tremendous amount of functionality between our customer, provider, and ops user-specific domains. Generally we use JWTs to secure customer and ops user sessions. These JWTs are issued only once valid credentials have been present.

Ops users are required to log in via Okta SSO, enforcing both valid username/password as well as a two-factor authentication (2FA) prompt. We use Devise and Omniauth to wire up the OIDC flow with our upstream identity provider, and we use CanCan to expand a set of roles per-user into a set of capabilities that restrict what resources they can access/modify.

Ops users primarily work in our internal pharmacy management app. However, the state of these users was shared between several different tables: Users, which otherwise refer to customers, and Employees, which share some ops user-specific state tied to Users.

So… what did we do?

Large change sets of code being deployed at once are harder to code review, harder to unwind if something goes wrong, and are more likely to be more disruptive. For this project, we knew that such a large architectural change had to be broken up into many more-digestible pieces if we were to be successful. The plan:

1. Create a new dedicated model for our ops users

This was the easy part: create the database structure for a properly isolated admin user type, which we called a WunderbarUser.

We now had our admin user state split between Employees and Users, with a new table that engineers and analysts wanted to use that was not yet populated, and we had to make progress while allowing new hires and new customers alike to onboard to our platform.

2. Synchronize state between the old tables and new table

Next, we created a WunderbarUser for any User who was, or ever had been, a Wunderbar user. This required some historical digging —we had a lot of people who were onboarded, and even some offboarded, before we even had the intermediate Employee record!

Two major choices at this stage really helped the project succeed. First, we ensured that the WunderbarUser records all had IDs matching the corresponding User records. This is because a lot of our legacy audit records (that we’re required to preserve and keep useful as a Covered Entity under HIPAA) referenced an `admin_user_id` column populated with the ID of the User. If we just created all new entries for WunderbarUsers, these old logs would require at least one level of indirection (e.g. querying back to User) to make useful.

# Create with the same ID as the user that we're creating this from. wunderbar_user_attributes = { id: user.id, user_id: user.id, email: user.email, roles_mask: user.roles_mask, date_of_birth: user.date_of_birth, token_last_invalidated_at: user.token_last_invalidated_at, first_name: user.first_name, last_name: user.last_name, preferred_first_name: user.preferred_first_name, is_active: false, }

Second, we created a new WunderbarUser any time we created a new Employee record (e.g. when we onboarded a new employee) and synchronized any changes between Users, Employees, and WunderbarUsers tables. As a side note: we knew we’d have to protect this syncing logic to ensure that we only copied each change once — otherwise, our callbacks would fire forever!

3. Migrate functionality slowly to use the new ops user model

We migrated logic to reference WunderbarUsers where previously we’d referenced Users and/or Employees. Authentication, authorization, audit logs, and pharmacist licenses were all updated here — albeit painstakingly!

We then migrated all of the foreign keys our various Postgres database tables held to either the Users or Employees tables to instead reference the WunderbarUsers. This ended up being around a hundred foreign keys, but a constraint-only migration that fortunately didn’t involve any data changes.

In short: our goal was to do this change slowly and steadily enough that nobody would notice until, one day, we could announce to the team, “Hey, we did a big security-pertinent project, and now you can change your customer email to be anything.”

Impact

Authentication: While we use the same JWT format for each, our customer and operations user authentication systems are now totally separated.
Authorization: With separate user types, we have separate capabilities and domain-specific authorization checks for each area, enforced in code.
Experience: Onboarding as a new Altoid (somebody who works here; come join us!) is totally separate from the customer experience, and you no longer need to link your accounts together.
Privacy: Our ops users who choose to be customers enjoy the full privacy afforded to our customers without tying their account to their work identity in any way.
Attribution: Legacy audit logs are still index-able and query-able with the same IDs, just in a slightly different table.
State management: Customer state lives on the User table, and ops user state lives on the new WunderbarUsers table. Our callbacks on each have been dramatically pared down, and our intermediary Employee table is entirely gone, along with a number of other now-ancillary tables.
Architecture: We have a clear path to federate RBAC checks without shipping our entire customer database between services.
Bonus: Performance and code readability generally improved by removing complex joins and join tables and isolating concerns between the customer and ops user domains.
Another Bonus: Significant refactorings and increased test coverage along the way while we cleaned up some very old, very stale code.

Large architectural changes of legacy systems with heavy use are rarely 100% smooth. Here are a few learnings to keep in mind the next time you tackle a hefty piece of technical debt.

1. Always validate your assumptions (e.g. the data your constraints are against) when foreign keys are concerned. Multiple times we assumed, “Hey, we’ve migrated a bunch of really big tables that are really old (insurance claims, prescriptions, and more), so there’s no way we could have another admin user that we haven’t created a commensurate WunderbarUser for, right?” Wrong! Your data can always have inconsistencies, but a simple SQL query (e.g. validate that all values of the column you’re migrating the foreign key to have a matching WunderbarUser) can save a lot of time and headaches.

2. When migrating foreign key constraints on tables with active reads in Postgres, ensure you break it into two steps. Migrate your foreign key with `{ validate: false }`

def up remove_foreign_key :inventory_orders, to_table: :users, column: :created_by_id add_foreign_key :inventory_orders, :wunderbar_users, column: :created_by_id, validate: false validate_constraint end

Validate the named constraint of your foreign key separately

def validate_constraint validate_constraint_statement = 'ALTER TABLE "inventory_orders" VALIDATE CONSTRAINT "fk_rails_abcd1234";' ActiveRecord::Base.connection.execute(validate_constraint_statement) end

3. When working with legacy tables, be mindful of legacy systems. Our Billing service used a very outdated golang ORM that panicked if a column specified by the `struct` in source no longer existed; this caused a late-night outage that made auto billing requests back up the next day.

4. Scoping projects is hard! This was a larger project than our team had ever scoped and delivered before. Once all the dust settled, this project took several months longer than our initial scoping estimates.This project was worth doing, but better estimates allow us to prioritize better. If we’d had a great scope of this immense project upfront, would we have chosen to do it first? Also, this was a tremendously difficult project to separate into material stages; all the benefit was really backloaded.

In the end, this was a very satisfying and important project to complete. We wouldn’t have been able to accomplish it without a lot of help from our colleagues on the Platform team, the Security team, and across many other functions atAlto. Very large projects that involve a lot of moving pieces and impact other teams are always scary — but we can’t let that fear prevent us from making the changes we know we need to do in order to be successful long-term.

The Engineering Team is hiring! Learn more about open positions here.