A bit about the "scraper web service" freelance project

10 Jul, 2023

In the last post I promised to publish some notes about the freelance project I was working on. So task was to deploy a scraper as a web service with REST interface and in the end there were 5 docker containers:

... -> nginx -> webrick -> sidekiq -> scraper
                v     v    v     v
                v  invoker.rb   redis
                v    v
               sqlite.db

At first I wanted to use Yandex Cloud Message Queue but the API consumer wants to have a view on queued tasks that is impossible (by design?) in case of YMQ. The sad part of the story is that the "sidekiq queue" is a thing scattered on several lists of different kinds and one can't just "check the queue" -- it's kind of undocumented and seems to lead to race conditions.
The invoker.rb is a source file that is required by both sidekiq and the webrick server that is creating jobs -- personally I don't like it but it's by Sidekiq design. Or am I missing something?
The sqlite.rb is required by both services because the job stores (writes) results obtained from the "scraper" container and then Webrick reads them.
The docker images used are 2x"ruby:2.7-alpine", 2x"redis:alpine" and one "nginx:stable-alpine".

Now the random notes:

According to docs the https://oauth.vk.com/authorize?client_id=<client_id>&display=page&response_type=code&v=5.131&scope=offline VK API endpoint should return the "code" that you are supposed to obtain the token but weirdly it returns the (eternal?) token directly.
I usually spawn Chrome for Ferrum automation on my macOS and have never tried to host a service that would live 24/7 and spawn the chromium on demand. After a few hours the docker container stopped working, emitting very different errors and the reason was that the chromium processes spawned by webrick were becoming zombies. I reported the issue here but as you can see I quickly figured out myself the fix is to use the init: true docker-compose option.
And just to make this post include some interesting code, here is a webrick base I used for two containers:

require "webrick"
WEBrick::HTTPResponse.class_eval do
  def create_error_page
    @body = "#{self.status} #{@reason_phrase}"
  end
end
SERVER = WEBrick::HTTPServer.new Port: ENV.fetch("PORT")
END{ SERVER.start }

You include it and then define the routes.

And now the tool compose-launcher. What is it for? I made it when I had a server that was several years old and that hosted multiple applications each with it's own docker-compose config. The compose-launcher is a tool for such cases to [re]deploy multiple non-related applications.

---
- :dir: scraper
  :repo: nakilon/my-repo
  :branch: dev
  :cd: dir-with-compose-config
  :compose: special
  :pre: touch consumer/sqlite.db
- :dir: another_project
  ...

When you run the tool with the above example config, it:

creates the directory #{Dir.home}/_/.compose-launcher/#{service.fetch :dir}"
clones and pulls the GitHub repo "nakilon/my-repo"
optionally switches (to a "dev") branch
optionally cds to a specific directory, and optionally prepends a prefix ("special" in this case) to docker-compose.yml
optionally executes the pre command before running
runs the docker-compose and ensures the containers are running
it may also pass the env variables (omitted in the above example)

What else? Oh yeah, I've added RSS and email subscription options to this blog ..) they are available at the Home page.

#blog #chromium #docker-compose #gem ferrum #ruby #sidekiq #vk api #webrick