infrastructure-update

Everything started on Heroku in October 2012 over their dynos with Heroku Postgres and continued on OpenShift in August 2013 over a LAMP stack based on Apache 2.4, PHP 5.3 and MySQL 5.1.

Now it’s time to to move my little blog on a modern stack. Best offer on OpenShift is a variation of the standard LEMP (we can call it: LEMP-HH) stack with HHVM 3.8, MariaDB 5.5 over NGINX 1.7.

lemphh-stack

Actually biggest performances improvement was achieved adding a good cache plugin a few months ago. I always used W3 Total Cache and WP Super Cache but, in this specific case, they are both complex to use because of the structure of OpenShift stack. Best solution I found is WP Fastest Cache plugin, one of the latest cache plugin I tested. Here is the stunning header of their website showing two beautiful cheetahs (are they cheetahs?).

wp-fastes-cache

Anyway coming back on new stack, there is no official bundle yet but you can create a new application using tengyifei’s HHVM 3.8 cartridge and adding OpenShift MariaDB 5.5 cartridge. I wasn’t able to run them on different gears (with scaling option activated and HAProxy) but seems fast enough on a single gear.

Filesystem structure is similar to the standard PHP bundle except for the application dir that is named www/ instead of php/. I used last backup from UpdraftPlus to migrate database on MariaDB. On non scalable applications you need to forward port in order to access DB from your local machine. RHC command is:

rhc port-forward -a application-name

Source here: Getting Started with Port Forwarding on OpenShift

Moving on NGINX also causes problems on permalinks because .htaccess doesn’t work anymore. The Nginx Helper plugin fix the problem but you could simply add a couple of row to NGINX configuration located in /config/nginx.d/default.conf.erb.

# Handle any other URI
location / {
try_files $uri $uri/ /index.php?q=$request_uri;
}

Discussion on WordPress support forum: WordPress Permalinks on NGINX

Refactor of previous filesystem, migration of database and bugfix of permalinks and other stuff takes about 2 hours and, at the end, everything seems working fine. I’m quite confident this a future proof solution but I’m going to test it until next major update 🙂

[UPDATE 2015-09-06 21:56 CEST]

After migration sitemap_index.xml and robots.txt weren’t reachable. Some rules were missing. I took the opportunity to switch to Yoast SEO for sitemap, Facebook open graph and Twitter cards. Then, these rules fix problems with SEO.

# Rewrites for WordPress SEO XML Sitemap
rewrite ^/sitemap_index.xml$ /index.php?sitemap=1 last;
rewrite ^/([^/]+?)-sitemap([0-9]+)?.xml$ /index.php?sitemap=$1&sitemap_n=$2 last;
# Rewrites for robots.txt
rewrite ^/robots\.txt$ /index.php?robots=1 last;

I always like The Setup. Discover what kind of technologies, hardware and softwares other skilled people are using is extremely useful and really fun for me. This time I’d like to share some tips from the complete reboot I did to my personal ecosystem after switch to my new Macbook.

macbook_pro_13_retina

From the hardware side is a simple high-end 2015 Macbook Pro 13″ Retina with Intel Core i7 Haswell dual-core at 3,4GHz, 16GB of RAM and 1TB of SSD PCI Express 3.0. Is fast, solid, lightweight and flexible. The only required accessory is the Be.eZ LArobe Second Skin.

From the software side I decided to avoid Time Machine restore in order to setup a completely new environment. I started on a OS X 10.10 Yosemite fresh installation.

As polyglot developer I usually deal with a lot of different applications, programming languages and tools. In order to decide what top install, a list of what I had on the previous machine and what I need more was really useful.

Here is a list of useful software and some tips about the installation process.

Applications

paid_apps

Paid softwares worth having: Evernote (with Premium subscription and Skitch) and Todoist (with Premium subscription) both available on the Mac App Store. 1Password, Fantastical 2, OmniGraffle, Carbon Copy Cloner, Backblaze and Expandrive available on their own websites.

Free software worth having: Google Chrome and Mozilla Firefox as browser, Apache OpenOffice, Skype and Slack as chat, VLC for multimedia and Transmission for torrents.

app_from_suites

Suites or part of: Adobe Photoshop CC, Adobe Illustrator CC and Adobe Acrobat Pro DC are part of the Adobe Creative Cloud. Microsoft Word 2016, and Microsoft Excel 2016 are part of Microsoft Office 2016 for Mac (now in free preview). Apple Pages, and Apple Keynote are preinstalled as Apple iWork suite as well as Apple Calendar and Apple Contacts.

Development tools

Utilities for Power Users: Caffeine, Growl and HardwareGrowler, iStat Menu Pro, Disk Inventory X, Tor Browser and TrueCrypt 7.1a (you need to fix a little installation bug on OS X 10.10), Kinematic and Boot2Docker for Docker, Sublime Text 3 (with some additions like: Spacegray Theme, Soda Theme, a new icon, Source Code Pro font), Tower, Visual Studio Code, Android SDK (for Android emulator) and XCode (for iOS emulator), VirtualBox (with some useful Linux virtual images), iTerm 2.

CLI: OhMyZSH, Homebrew, GPG (installed using brew), XCode Command Line Tools (from Apple Developers website), Git (with git-flow installed using brew), AWS CLI (install via pip), PhantomJS, s3cmd and faster s4cmd, Heroku toolbelt and Openshift Client Tools (install via gem).

daemons

Servers: MariaDB 10.0 (brew), MongoDB 3.0 (brew), Redis 3.0 (brew), Elasticsearch 1.6 (brew), Nginx 1.8.0 (brew), PostgreSQL 9.4.2 (via Postgres.app), Hadoop 2.7.0 (brew), Spark 1.4 (download from official website), Neo4j 2.2 (brew), Accumulo 1.7.0 (download from official website), Crate 0.49 (download from official website), Mesos 0.22 (download from official website), Riak 2.1.1 (brew), Storm 0.9.5 (download from official website), Zookeeper 3.4.6 (brew), Sphinx 2.2 (brew), Cassandra 2.1.5 (brew).

languages

Programming languages: RVM, Ruby (MRI 2.2, 2.1, 2.0, 1.9.3, 1.8.7, REE 2012.02, JRuby 1.7.19 installed using RVM), PHP 5.6 with PHP-FPM (installed using brew), HHVM 3.7.2 (installed using brew with adding additional repo, has some issues on 10.10), Python 2.7 (brew python) and Python 3.4 (brew python3), Pip 7.1 (shipped with Python), NVM, Node.js 0.12 and IO.js 2.3 (both installed using NVM), Go 1.4.2 (from Golang website), Java 8 JVM (from Oracle website), Java 8 SE JDK (from Oracle website), Scala 2.11 (from Scala website), Clojure 1.6 (from Clojure website), Erlang 17.0 (brew), Haskell GHC 7.10 (brew), Haskell Cabal 1.22 (brew), OCaml 4.02.1 (brew), R 3.2.1 (from R for Mac OS X website), .NET Core and ASP.NET (brew using DNVM), GPU Ocelot (compiled with a lot of libraries).

Full reboot takes about 2 days. Some software are still missing but I was able to restart my work almost completely. I hope this list would be helpful for anyone 🙂

bedrock_big_logo During the last couple of weeks I had to work on a PHP project with a custom WordPress stack I have never used before: Bedrock.

The home page says “Bedrock is a modern WordPress stack that gets you started with the best development tools, practices, and project structure.“.

What Bedrock really is

It is a regular WordPress installation with a different folder structure and is integrated with composer for dependencies management and capistrano for deploy. The structure reminds Rails or similar frameworks but contains usual WordPress component and run on the same web stack.

├── composer.json
├── config
│   ├── application.php
│   └── environments
│       ├── development.php
│       ├── staging.php
│       └── production.php
├── vendor
└── web
├── app
│   ├── mu-plugins
│   ├── plugins
│   ├── themes
│   └── uploads
├── wp-config.php
├── index.php
└── wp

Server configuration

The project use to works on Apache with mod_php but I personally don’t like this stack. I’d like to test it on HHVM but at the moment I preferred to run it on nginx with PHP-FPM. Starting with an empty Ubuntu 14.04 installation I set up a LEMP stack with memcached and Redis using apt-get:

apt-get update
apt-get install build-essential tcl8.5 curl screen bootchart git mailutils munin-node vim nmap tcpdump nginx mysql-server mysql-client memcached redis-server php5-fpm php5-curl php5-mysql php5-mcrypt php5-memcache php5-redis php5-gd

Everything works fine except the Redis extension (used for custom function unrelated with WordPress). I don’t know why but the config file wasn’t copied into the configuration directory /etc/php5/fpm/conf.d/. You can find it among the available mods into /etc/php5/mods-available/.

PHP-FPM uses a standard configuration placed into /etc/php5/fpm/pool.d/example.conf. It listen on 127.0.0.1:9000 or unix socket in /var/run/php5-fpm-example.sock (I assume the configured name was “example”).

Memcached should be configured to be used for session sharing among multiple servers. To activate it you need to edit the php.ini configuration file setting the following parameters into /etc/php5/fpm/php.ini

session.save_handler = memcache
session.save_path = 'tcp://192.168.0.1:11211,tcp://192.168.0.2:11211'

nginx configuration is placed into /etc/nginx/sites-available/ and linked into /etc/nginx/sites-enabled/ as usual and forward request to PHP-FPM for PHP files.

server {
listen 80 default deferred;
root /var/www/example/htdocs/current/web/;
index index.html index.htm index.php;
server_name www.example.com;
access_log /var/www/example/logs/access.log;
error_log /var/www/example/logs/error.log;
location / {
try_files $uri $uri/ /index.php?q=$uri&$args;
}
location ~\.php$ {
try_files $uri =404;
# fastcgi_pass 127.0.0.1:9000;
fastcgi_pass unix:/var/run/php5-fpm-example.sock;
fastcgi_index index.php;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
include fastcgi_params;
}
location ~ /\.ht {
deny all;
}
}

Root directory will be web/ of Bedrock prefixed with current/ to support the capistrano directory structure displayed below.

├── current -> /var/www/example/htdocs/releases/20150120114500/
├── releases
│   ├── 20150080072500
│   ├── 20150090083000
│   ├── 20150100093500
│   ├── 20150110104000
│   └── 20150120114500
├── repo
│   └── <VCS related data>
├── revisions.log
└── shared
└── <linked_files and linked_dirs>

Local configuration

I’m quite familiar with capistrano because of my Ruby recent background. You need a Ruby version greater then 1.9.3 to run it (RVM helps). First step is to download dependencies. Ruby uses Bundler.

# run it to install bundler gem the first time
gem install bundler
# run it to install dependencies
bundle install

Bundler read the Gemfile (and Gemfile.lock) and download all the required gems (Ruby libraries).

Now the technological stack is ready locally and on server 🙂
I’ll probably describe how to run a LEMP stack on OS X in a next post. For the moment I’m assuming you are able to run it locally. Here is useful guides by Jonas Friedmann and rtCamp.

Anyway Bedrock could run over any LAMP/LEMP stack. The only “special” feature is the Composer integration. Composer for PHP is like Bundler for Ruby. Helps developers to manage dependencies in the project. Here is used to manage plugins, themes and WordPress core update.

You can run composer install to install libraries. If you update libraries configuration or you want to force download of them (maybe after a fresh install) run composer update.

Deploy

Capistrano enable user to setup different deploy environment. A global configuration is defined and you need to specify only custom configuration for each environment. An example of /config/deploy/production.rb:

set :application, 'example'
set :stage, :production
set :branch, "master"
server '192.168.0.1', user: 'user', roles: %w{web app db}

Everything else is inherited from global config where are defined all the other deploy properties. Is important to say that deploy script of capistrano on Bedrock only download source code from Git repo and run composer install for main project. If you need to run in on any plugin you new to define a custom capistrano task and run it after the end of deploy. For instance you can add in the global configuration the following lines in order to install dependencies on a specific plugin:

namespace :deploy do
desc 'Rebuild Plugin Libraries'
task :updateplugin do
on roles(:app), in: :sequence, wait: 5 do
execute "cd /var/www/#{fetch(:application)}/htdocs/current/web/app/plugins/anything/ && composer install"
end
end
end
after 'deploy:publishing', 'deploy:updateplugin'

Now you are ready to deploy your Bedrock install on server!
Simply run cap production deploy, restart PHP-FPM (service php5-fpm restart) and enjoy it 😀

Many thanks to Giuseppe, great sysadmin and friend, for support during development and deploy of this @#@?!?@# application.

phpday_logoI used to refer to me as “PHP Developer” for a long time. More or less from the beginning of my career to 2011 when I moved to Ruby as main language. Last time I worked for real using PHP, tools were still uncomfortable and community was huge but still messy. 4 years later everything seems changed. Several important companies (including Facebook) leverage on it and both core language and most popular tools are improved a lot.

Things like HHVM, HackFIG, Composer, PHP7 many more are rapidly evolving the PHP landscape so I decided to attend the PHPDay 2015 to meet the Italian community refresh my knowledge.

It was a really interesting event. I had the opportunity to meet several awesome people and chat with them about almost everything (quite often chats happened on the grass of the beautiful location in Verona 🙂 )

phpday_attendees

Davey Shafik (@dshafik), a funny guy from Engine Yard, talks about PHP7 and HHVM as major improvements for the core of PHP. Enrico Zimuel (@ezimuel), core developer of Zend Framework, talks about ZF3. Steve Maraspin (@maraspin) talks about his experience with async PHP and parallel processing. Bernhard Schussek (@webmozart), core developer of Symfony, talks about Puli: a framework agnostic tool fro resource sharing.

I strongly believe PHP is changed. It included many “good parts” from other languages and is now ready to became, with the Facebook endorsement, a first class language.

See you next year at PHPDay 2016 😀

A  couple of weeks ago I was playing with Hack looking for online resources. I found on Youtube the playlist of the Hack Dev Day 2014 where Hack was officially presented to the world.

Introduction to the language by Julien Verlaguet is really interesting, it show the advantages of static typing and how the HHVM is able to preserve the rapid development cycle of PHP.

Also talk by Josh Watzman is interesting. He talks about how to convert PHP code to Hack code and years of experience at Facebook are extremely useful.

The conference also talks about how to run HHVM on Heroku, gives an overview of library and common use cases of Hack and talks about HHVM strong optimization.

If you are playing with Hack I absolutely recommend these videos.

A few hours after I posted about DataSift architecture, @choult, one of the about 25 ninjas who develop DataSift platform, tweet me.

The following SlideShare presentation by @stuherbert, another ninja, talks about the use of PHP in DataSift. Unlike what you may think, PHP is widely used in data processing.

datasift_repo_languges

System is decomposable in three major data pipelines:

  • Data Archiving (Adds new data to Historic Archive)
  • Filtering Pipeline (Filtering and delivery data in realtime)
  • Playback Pipeline (Filtering and delivery data from Historic Archive)

And PHP is used for many parts of these.

datasift_php

They use a custom build of PHP 5.3.latest with several optimizations and compiled-in extensions (ZeroMQ, APC, XHProof, Redis, XDebug). The also develop some internal components:

  • Frink, tweetrmeme’s framework
  • Stone: foundation of in-house test tools, Hornet and Storyteller (they probably open source a fork named Storyplayer).

Unfortunately I wasn’t able to find more details about these. Anyway, here is the presentation:


datasift_logo

DataSift, as they said on their home page, “aggregate, process and deliver social data“. It is one of the oldest Twitter certified partners and offers data coming from almost every existing social network. I use it everyday to “listen” the net and find data I need for my analysis.

It’s impressive to watch how fast they collect data from external sources and deliver it to your chosen destination. When I tweet, a couple of minutes ago a JSON file land my S3 bucket.

To create an Internet scale filtering is not easy. Their infrastructure is really complex and optimized. This is a 2011 diagram of their workflow.

datasift_infrastructure

Twitter generates more than 500 million tweets per day and is only one of the available resources. The DataSift system performs 250+ million sentiment analysis with sub 100ms latency, and several TB of augmented (includes gender, sentiment, etc) data transits the platform daily. Data Filtering Nodes Can process up to 10,000 unique streams. Can do data-lookup’s on 10,000,000+ username lists in real-time. Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

C++ is used for the performance-critical components, like the core filtering engine and PHP is for the site, external API server, most of the internal web services, and a custom-built, high performance job queue manager. Java/Scala for batch processing with HBase and MapReduce jobs. Kafka is used as queuing system and Ruby is used for deploys and provisioning. Thrift is widely used.

MySQL (Percona server) on SSD drives is used as primary storage, HBase cluster over more than 30 Hadoop nodes provides a place to store historical data and Memcached and Redis are used for caching.

Here is a schema of the processing unit which build the historical database.

datasift_historical

Message queues are another critical component of the infrastructure. 0mq (custom build from latest alpha branch, with some stability fixes, to use publisher-side filtering), used in different configurations:

  • PUB-SUB for replication / message broadcasting;
  • PUSH-PULL for round-robin workload distribution;
  • REQ-REP for health checks of different components.

Kafka for high-performance persistent queues. In both cases they’re working with the developers and contributing bug reports / traces / fixes / client libraries.

All code is pulled from the repo from Jenkins every 5 mins, automatically tested and verified with several QA tools, packaged as an RPM and moved to the dev package repo. Chef is used to automate deployments and manage configuration. All services emit StatsD events, which are combined with other system-level checks, added to Zenoss and displayed with Graphite.

The biggest challenge IMHO is filtering. Filtering at this scale requires a different approach. They started with work they did at TweetMeme. The core filter engine is in C++ and is called the Pickle Matrix. Over three years they’ve developed a compiler and their own virtual machine. We don’t know what their technology is exactly, but it might be something like Distributed Complex Event Processing with Query Rewriting.

Sources

Almost all content of this post come from the wonderful article “DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second” posted on HighScalability. Some details also from “Historical Architecture – Data Mining Billions of Tweets” from DataSift blog.

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a non-relational data model, and most operations are performed in memory. Access interface is an advanced protocol memcached. Specially compiled keys return the results of complex queries. They said is developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB 😀

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

It’s about half an year I want to move my blog away from Heroku. It’s the best PaaS I ever used but the free plan has a huge limit: the dynos idle. In a previous post i talked about how to use Heroku to build a reverse proxy in front of AppFog to avoid theirs custom domain limit but the idle problem is still there. My blog has less than 100 visits per day and almost every visitor has to wait 5-10 seconds to view home page because dynos are always idle.

openshift_logoToday I decided to move to another platform suggested by my friend @dani_viga: OpenShift. It’s a PaaS similar to Heroku which use Git to control revision and has a similar scaling system. And the free plan hasn’t the idle problem and it’s 10 times faster!

I created a new application using the following cartridge: PHP 5.3, MySQL 5.1 (I’d like to use MariaDB but cartridge is still in development and I couldn’t install it) and phpMyAdmin 3.4. They require a Git repo to setup application and provide a WordPress template to start. I used it as template moving code of my blog into /php directory.

The hard part was to migrate my PostgreSQL database into the new MySQL. To start I removed PG4WP plugin following installation instruction in reverse order.

Then I exported my PostgreSQL database using heroku db:pull command. It’s based on taps and is really useful. I had some problems with my local installation of MySQL because taps has no options about packet size and character set so you must set them as default. I added a few line to my.cnf configuration:

# enlarged, before was 1M
max_allowed_packet = 10M
# default to utf-8
skip-character-set-client-handshake
character_set_client=utf8
character_set_server=utf8

At the end of the pull my local database contains a exact copy of the Heroku one and I can dump to a SQL file and import into the new MySQL cartridge using phpMyAdmin.

The only problem I had was about SSL certificate. The free plan doesn’t offer SSL certificate for custom domain so I have to remove the use of HTTPS for the login. You can do in the wp-config.php setting:

define('FORCE_SSL_ADMIN', false);

Now my blog runs on OpenShift and by now seems incredibly faster 😀