Taggrowth

Hightouch vs. Segment: How we do reverse ETL at Reforge

How we use Reverse ETL

We use reverse ETL here at Reforge for a bunch of different use cases. They are:

  • Sending data about our customers to our email automation system, Iterable
  • Sending data that is aggregated in our data warehouse (snowflake) to a Google sheet
  • Sending data to our behvaioral analytics system, Amplitude
  • Sending data to our CRM (HubSpot) from our data warehouse
  • Sending data to Slack from our data warehouse

Examples

We use Segment and Personas features to send data to our email system, Iterable.

There are cool features in Personas that allow you to use a GUI to define things like:

  • Counters (example: how many page views that meet a certain criteria in the last X days)
  • Aggregation (example: how much revenue have they generated in the past quarter)
  • Most frequent (example: what’s the most frequent blog post they’ve visited)
  • First (example: what’s the first marketing campaign that got a user to visit your site)
  • Last (example: what’s the last marketing campaign that got a user to visit your site)

They also have the ability to write arbitrary SQL to pull a list of people and any columns in the SQL can be traits associated with a user. So you can use GUI to define basic things or use complicated SQL to pull sophisticated stats.

An example of a user trait we have in Segment:

A SQL user trait in Segment that pulls a list of applicants to Reforge and values associated with their application

Important caveat: it’s on our roadmap to pull all of this logic out of SQL and put it into a DBT model. The benefits there are better testing, understanding of data lineage in a DAG, and we have all of it under source control. In the future the SQL for this trait could be something like applications.user_application_values.

In our data dictionary we define these properties:

Screenshot of our data dictionary covering the same fields

Then we can use the traits to send personalized campaigns in Iterable for many different scenarios:

  • Application acceptance or denial
  • Payment nudges to remind someone they’ve been accepted
  • Stats about how many people from their company have participated

Here’s an example of a campaign in Iterable:

Sample campaign in Iterable that uses the fields from the SQL trait / data dictionary

You can see how we have a dynamic template in the email above that personalizes the email subject to have the company name they select as well as stats about the company with things like:

  • How many people from their company have participated in Reforge in the past
  • How many people from their company are current members
  • How many people from their company have applied recently

I think it’s pretty great that Segment has this functionality especially when you combine it with their other platform tools like Protocols. That said, I have some gripes about the tool as of November 2022:

  • On our current tier the most frequent traits can run is hourly. Maybe I can pay for a higher frequency, but that drives me nuts and I’d have to negotiate with a sales rep for that.
  • It’s really hard to test a trait is working the way you want it, you can’t trigger a sync. This drives me nuts because I always make a simple mistake and then either try to get it to trigger or have to wait an hour to see if an update fixed it.
  • The destinations you can sync to are the typical ones for a CDP. I wish I could sync a result set with a spreadsheet, but that’s not possible.
  • I can’t sync non-contact objects to our CRM. For instance, I can sync contacts and their properties to HubSpot but I can’t sync subscriptions to a custom object or companies and their details to a company object in HubSpot.
  • It feels like Segment has abandoned working on this feature. They’re busy integrating with Twilio and building things like Twilio Engage rather than investing in making Personas great. I might have missed it, but it feels like they haven’t made changes to the product in years.

How we use Hightouch

We use a second ETL provider at Reforge. It sucks that we have two, but my hope is that it’s more of a transitionary period until we ultimately simplify with one. I started using Hightouch for free to sync data to a google sheet / to slack, and it became clear it was the vastly superior product to Segment Personas.

The world runs on spreadsheets, and often times it’s really helpful to give teams a constantly updated resource that has data synced from your data warehouse. It’s a great free path to get using their tools because once you have it connected, it’s so easy to add additional sources.

Here’s an example of how we send paid data into a spreadsheet for the marketing team (we pull paid numbers across all channels into our data warehouse and this helps them understand all of it and play with it in a spreadsheet environment rather than pulling it from a BI tool constantly):

Sending aggregate paid spend numbers into a spreadsheet for the marketing team

The model can be custom SQL, a table in a DB, or a model in dbt or Looker. It makes it really easy.

We also use Hightouch to sync data to our CRM, HubSpot. It’s just so easy.

We also use Hightouch to send customized messages to Slack. I’ve done this using Zapier in the past but it’s nice to be able to have all of these syncs in one place. Here’s an example of a daily ARR report that the team built:

Our daily update that sends data from Snowflake to Slack

Here’s what it looks like in Slack:

An example of a daily summary the team built via Hightouch

I spoke with the co-founders of Hightouch when it was early days and that’s when I found out they were part of the early team that built Personas at Segment. It feels like they were able to address a lot of their frustrations about how the product was built and make big improvements on it. Things that are just so much better than Segment:

  • Their support is so much better than Segment. This should be every startup’s advantage against the company acquired by a big behemoth, and they do a good job. I have been so disappointed with Segment’s support over the years and Hightouch makes the overall experience so much better because you can get an answer to why something is broken quickly. With Segment support, I often times have to wait days which is so painful when you’re working on a deeply technical integration like this.
  • They have a “run now” button where you can run a sync anytime you want. This sounds so simple, but I’m sure it makes the technical implementation much harder. I can’t tell you how many times I’ve messed up a piece of a sync in Segment personas and then I just have to wait until the sync runs again. It baffles me they haven’t changed this.
  • It feels like it’s being worked on all the time. Lots of changes and improvements, new features, and integrations with the latest and greatest tools.
  • The logging and insight into what happened is so much better with Hightouch. This is a real interaction I had with Segment’s support team (no judgement on the support rep, this is a reflection of the state of their product). This type of response is infuriating when you are working to integrate two systems together and you don’t know what’s wrong or when it’ll retry some logic.
A screenshot of my support ticket with Segment. I was very frustrated!

Conclusion

I look forward to moving all of our ETL infrastructure over to Hightouch from Segment at annual contract renewal time. While there are benefits to having the reverse ETL logic and elements all in Segment (we pay for their protocols feature and there are benefits in blocking bad data / being alerted when your data doesn’t adhere to the spec), I would rather use the better tool.

How do others do this type of work at their companies? I’d love to know if others solve their challenges similarly and if there are even better ways.

Alerting on Snowflake Cost Spikes with Metaplane

We switched from a postgres data warehouse earlier this year to a Snowflake data warehouse. As part of that transition we spoke with sales and learned about what drives your bill on snowflake. It turned out that 15-20% of our bill with snowflake was because of our BI tool, Metabase.

That led me to wonder if I could understand better what was driving our costs on snowflake and ensure that we didn’t get an unexpected bill. Our storage costs are tiny (we’re lucky at Reforge that we have enough data to make it useful, but not so much that the costs are prohibitive / make us move slowly).

I did two things as a result of this:

  1. Visualized the costs over time.
  2. We setup Metaplane so that it would monitor our daily usage and alert if it deviated from the historical norm. This was really easy.

Visualizing Costs over Time

I got some great feedback from the DBT community about the types of queries I should be running to run down costs. They asked good questions like:

  • How are your warehouses configured?
  • What warehouses are creating the most spend?
  • Are the warehouses set to auto-suspend?
  • What services are contributing to your bill?
  • Is your spend mostly on storage or compute?
  • What queries cost the most?

Ultimately, one of the things that came out of the analysis was that all of our bill was being used for compute. The sales team at snowflake said that most customers don’t pay for “cloud services” and that we were an outlier. That led us to look into what was driving the overage on our cloud services. I visualized the cost over time with Deepnote for both compute and cloud services. You can see that “overage-cloud services” represent a big part of our bill, and we attributed most of that cost to the metabase queries.

Our montly snowflake bill broken out by usage type

We discovered the issue in April and took some action to try to reduce it. If you want to do the same analysis, I posted the notebook here. I’d love to build a flowchart of the types of things you should be looking at when analyzing your snowflake bill and adding them to the jupyter notebook.

Being Alerted to When Costs Go Haywire

Metaplane is a tool that looks at your data systems and lets you know when something is broken. It could be that tables have stopped being updated, it’s not being updated at the same right, or someone is spamming events to your system. It has many different kinds of tests, but ultimately it helps to ensure that you know when things are going haywire so you can fix it before it causes problems or causes others to lose confidence in the reliability of your data and systems.

We configured Metaplane to run this query every day for both cloud services and compute spend, which are the vast majority of our spend. If it deviates from the historical norm it alerts the team so we can dig into what is increasing our usage and our bill. A big thanks to Ian Whitestone in the DBT community who helped with suggestions on this query as well as other ideas.

SELECT
  SUM(usage_in_currency) as total_cost
from SNOWFLAKE.ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
where usage_type = 'overage-cloud services'
    and usage_date >= current_date - interval '1 day' 
    and usage_date < current_date
order by 1
Metaplane monitoring compute credits over time, our usage spiked in the last data point to 2X the normal range
The slack notification when spend deviates from what we expect

Example Use

In the above image you can see how there was a big spike in usage that was much higher than the previous sixty days. Our team was notified, looked into the issue, and this was the result:

found two queries made in the Snowflake UI that had been running for 20 and 40 hours respectively.  I force canceled the queries and let the user know what happened.  We also decided to update the query timeout account wide from it’s default value of 48 hours down to 1 hour.

One of the data engineers on the team

This is a great example of how Metaplane ensures that we’re on top of anything that is unexpected and can affect our team. It may be data quality issues, reporting issues, delays, logic issues, or something like costs. Examples like this confirm that data observability is a must-have that all companies will have in the next couple of years.

Next Steps

Ultimately, we didn’t think it was worth our team’s time to try to optimize our spend on Snowflake due to our metabase issues. I’d rather the team spend time on creating new value for our company and our customers rather than spending a lot of time to save $1,000 / month. There will come a point where it’s worth it for someone on our team to look at this and a myriad of other areas for improvements, but until then we’ll be monitoring our spend closely to ensure it doesn’t change materially from where it is today. If anyone has any tips or insight into how to address the metabase problems, let me know!

Analyzing Metabase Usage

I am the head of data at Reforge (we’re hiring!) and data is critical to our success. We use it to track our business, individual teams leverage it in their work, and we believe in empowering people in the company with data to help them do great work. We use all sorts of great tools, but as the head of data I felt it was important to understand whether people were actually using our BI (business intelligence) tool.

I built a weekly report that lays out how my company is using our BI tool, Metabase. It has the following reports:

  • Dashboards created in the past 7 days
  • Questions created in the past 7 days
  • Individual Dashboard views by our exec team in the past 7 days
  • Dashboard view counts by user in the past 7 days
  • Activity by my boss (our CTO)
  • Metabase MAUs
  • Metabase WAUs

It helps me to understand at a high level what is happening inside the company from a data perspective:

  • Who is building reports (questions / dashboards)
  • What questions people are asking of our data
  • What is most important to our exec team
  • Who is looking at information on a regular basis
  • What is top of mind for my boss
  • How engaged our company is with our data & analytics

It was surprisingly easy to set this up. All we had to do was add the Metabase internal database as a source to Metabase (sounds confusing, but actually quite simple). It wouldn’t shock me if this became a paid feature long-term for Metabase on their enterprise plan, but it took less than 30 minutes to figure out their data model and build some quick and dirty SQL questions that produced answers to our most important questions. Metabase is already storing all of the data we needed to visualize to get answers to our questions.

I set this up as a dashboard subscription so it is delivered to my email every week and I can quickly scan it to understand what’s top of mind for the org.

Reforge is Data Driven

Since the beginning of the company we’ve looked at data to understand how well we’re tracking against key goals for the company (revenue, applications, content completion, weekly activity, etc). These stats span across the entire company in terms of usefulness (support, sales, marketing, product, engineering, design, finance, etc) and we strive to make as much information accessible to everyone so that they can be data informed.

If you’re considering joining a company – do you want to be at one that uses data to help inform decisions? You can see below that Reforge employees are using data on a regular basis and you can see how it scales as we’ve grown the team and hit key milestones. We’ve typically seen 75-100% of employees leveraging Metabase on a monthly basis.

Our Metabase monthly active users along with company milestones

The queries that we use to power the charts:

Dashboards created in the past week

select created_at, name, u.email, left(description, 40) as description, 'https://BI_HOSTNAME/dashboard/' || d.id as url
from report_dashboard d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '7 days'
order by 1 desc

Questions created in the past week

select created_at, name, u.email, 'https://BI_HOSTNAME/question/' || d.id as url
from report_card d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '7 days'
order by 1 desc

Exec team dashboard views in the past week

For this report, I wanted to know what the exec team was looking at in the past week. I want to know how much they’re looking at the metrics and whether they’re looking at the dashboards I think they should be looking at.

select u.email, d.name, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and u.email in 
        (LIST_OF_EMAILS)
    and v.timestamp > (current_date - interval '7 days')
group by 1, 2
order by 3 desc
limit 200

Dashboard views leaderboard (views in the past week)

This shows the number of views by user in the past 7 days. It’s always a good sign when your head of data isn’t the most prolific viewer of dashboards.

List of users and their views
select u.email, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and v.timestamp > (current_date - interval '7 days')
group by 1
order by 2 desc
limit 200

Most popular dashboards by views

select d.name, count(v.id)
from view_log v
    inner join core_user u
        on v.user_id = u.id
    inner join report_dashboard d
        on v.model_id = d.id
where model = 'dashboard'
    and v.timestamp > (current_date - interval '7 days')
group by 1
order by 2 desc
limit 200

Questions created by a single person

select created_at, name, u.email, 'https://BI_URL/question/' || d.id as url
from report_card d
    inner join core_user u 
        on d.creator_id = u.id
where created_at > current_date - interval '28 days'
    and u.email = EMAIL_OF_YOUR_BOSS
order by 1 desc

Metabase MAUs

select date_trunc('month', timestamp), count(distinct user_id) 
from view_log 
group by 1
order by 1

Metabase WAUs

select date_trunc('week', timestamp), count(distinct user_id) 
from view_log 
group by 1
order by 1

Being the closest to the customer

If you read about the strategy of successful tech companies today, it’s all about having “obsessive customer focus” (Jeff Bezos’ 2016 Amazon annual shareholder letter). You’ll hear that “whoever gets closest to the customer wins” (Drift), and that companies want to “solve for the customer” (HubSpot culture code). Ultimately, the question isn’t about whether you’re focused on your customers, but how you go about evaluating who they are, what they’re trying to accomplish, and how they’re interacting with your product. I recently started using a workflow that gives me a continual stream of feedback, allows me to go back and forth to dig deeper and clarify important questions, and then also easily share the results with my team. It also required no effort from our engineering team to setup and didn’t require additional budget.

I work at Reforge, and we’re an education company. We offer programs for those located in SF, but we also have an equivalent online-only experience. For a bunch of our key initiatives this year, I felt that I didn’t fully understand what brought people back to our web app after some time away, and I wanted to dive deeper to improve it. Rather than do a one off survey, I setup a campaign that runs continuously to deliver this feedback on a daily basis.

Using Segment to create our list of “alumni” to survey

When I arrived at Reforge we were already using Segment, and we ended up buying their personas add-on. I’m a relatively happy customer (and happy they just raised $175 million), but used it because we were already paying for it. I’m a big fan of using the tools available to you.

Segment has a feature called Audiences that lets you create lists of people. Since most of our important data attributes and events were already flowing through Segment, it was very easy to define small segments or large swathes of our user base through a simple editor. While I live and breathe SQL, sometimes it’s really nice to build it out in a GUI. Here’s what I built:

The nice thing about this is that when someone enters this audience, it means that they’re an alum, they’re not in our most recent cohort, and they have viewed our online material. The cool thing is that you could specify anything about the users that you have available (role, country, seniority, depth of engagement, type of user, organization, etc).

A feature of Audiences is that you know when someone enters the audience. The way I’ve structured the audience, this will mean that they came back to our site and it has been more than 90 days since their last visit – otherwise they’d already been a member. So it’s a cool way to know when someone has come back.

Send the list of people to Zapier

Segment then allows you to send this information to any of their supported destinations. I sent this information to Zapier:

Segment tells Zapier that the user has come back to our site, and Zapier then writes it to a google spreadsheet. This is what our looks like:

What you can see here is that a user from HubSpot named Kieran has revisited how to build a qualitative growth model. I am also using a Segment API to pull in their first name and the title of the last page they visited, in case I want to include that in my outreach asking for feedback. Segment is continuously sending data about people coming back to Reforge, and each time it happens Zapier is writing it to a Google Spreadsheet for me.

Email the people revisiting the material

I then have another Zap that takes the rows from this spreadsheet and emails the person from my personal G-suite account.

You can see what it looks like if I look in my sent folder within gmail (and you can see that people have replied to the email from Gmail’s threads):

Via this workflow, I’m automatically emailing people that are coming back to our site asking them what brought them back. I love getting a continual stream of this feedback. Because it’s in Gmail, I can go back and forth with them to clarify what they mean and to dive deeper to understand. If you have a huge user base, you can easily filter down the number of people you email with another step in Zapier (mod their user id by a number to make sure you don’t email too many people at once).

Collect all of the feedback in a Google Doc

At this point I’m automatically emailing everyone from the segment I care about, and I’m able to go back and forth to clarify any questions that I have or dig deeper. Rather than copy and paste their responses into a google doc to share snippets of feedback / soundbites, I hooked up another Zap to automatically pull in their feedback and put it in a google doc.

When the feedback emails come in, I setup a gmail filter to automatically apply a label to the email. I use Zapier to look for new emails under that label, and I exclude any emails from me (my replies to them). Zapier then puts the emails as a spreadsheet row in a google sheet:

Then I used a couple of simple formulas to combine all of the emails from a single user into a single row in another spreadsheet:

I separate the replies with a ———, so the above row represents multiple emails back and forth with this person. You can see that my first question was about uses cases, and then the third one was about the ones that come up frequently.

Column A is set to be “=UNIQUE(Emails)”. That means that there will only be one row for each email address I have feedback from. The formula for column B is “=ARRAYFORMULA(TEXTJOIN(CHAR(10) & “——–” & CHAR(10), TRUE, IF(Emails=A2,Response,””)))”

Array formulas are really cool, I’ve only had the need to use them a couple of times but I am always so impressed with their functionality. Basically this formula tells Google Sheets to combine all of the emails from users (remember, each row is a single user) together with the “——–” separator.

This is pretty powerful. Now I have a spreadsheet that has the entire conversation with someone in a single spreadsheet row, and I can share that spreadsheet with my entire team. We can then add columns to categorize feedback into buckets for easy filtering / reading.

Why I love this approach:

  • I get to read feedback from a critically important segment of users every day. I can define multiple segments to run simultaneously. The only limit is the limit on emails I can send from my Google account (2,000 messages), and the number of emails I have time to respond to.
  • I get to follow up with them in my main email tool
  • Their feedback then gets pulled into a spreadsheet automatically that I can share with my team members, categorized, and filtered.
  • While this isn’t the easiest thing to put together, this is so much easier than it used to be: writing complex SQL by hand, setting up a cron job to run this, writing a custom Gmail script, and then store this information in some database / google sheet). This is so much easier. If you have read this far and are thinking about building this – let me know I’d love to try it out first.

Is there an easier way to accomplish this? Let me know, I’d love to switch to it.

© 2024 Dan Wolchonok

Theme by Anders NorénUp ↑