Difference between revisions of "DigI:TI1.1"

From its-wiki.no

Jump to: navigation, search
(1.3.- RouterOS L7 filter)
m (Content filtering)
 
(18 intermediate revisions by the same user not shown)
Line 14: Line 14:
  
 
= Content filtering =
 
= Content filtering =
by Iñaki GaritanoWED SEP 12
+
by Iñaki Garitano (12Sep2018)<br/>
 
Basic understanding of InfoInternet standard:
 
Basic understanding of InfoInternet standard:
 
* Text & pictures: allowed
 
* Text & pictures: allowed
Line 53: Line 53:
 
* Mikrotik devices have to be populated by new configuration updates.  
 
* Mikrotik devices have to be populated by new configuration updates.  
 
* Some performance overhead may occur. Would be interesting to  somehow measure it.
 
* Some performance overhead may occur. Would be interesting to  somehow measure it.
* '''overhead is very small, we can easily handle 60 GByte/day on an RB960 (RDB952) - only when traffic is tagged and measured'''
+
* '''overhead is very small, we can easily handle ~70 Mbit/s (including 25 rules firewall, results in 60 GByte/day) on an RB960 (RDB952 - max 900 Mbit/s) - only when traffic is tagged and measured'''
 
   
 
   
 
2.- Semi centralized/decentralized - some actions have to be done in the core while others in the Mikrotiks  
 
2.- Semi centralized/decentralized - some actions have to be done in the core while others in the Mikrotiks  
Line 64: Line 64:
 
* Cons:  
 
* Cons:  
 
* All traffic needs to go through a centralized device.  
 
* All traffic needs to go through a centralized device.  
* '''not suitable, as the backbone traffic is the main cost''' (and other topics such as virus filtering, internation traffic...)
+
* '''% applicable to some traffic, not all traffic''' - not suitable for all traffic, as the backbone traffic is the main cost (and other topics such as virus filtering, internation traffic...)
  
 
=== 1.1.- Whitelist  ===
 
=== 1.1.- Whitelist  ===
- Pros:  
+
White listing is the ''closed world approach'', which is easier to manage
- The easiest one to implement.  
+
 
- Allows to reduce most of the traffic.  
+
Pros:  
 +
* The easiest one to implement.  
 +
* Allows to reduce most of the traffic.  
  
 
Cons:  
 
Cons:  
Line 77: Line 79:
 
* '''not completely right, as facebook uses video servers, which can be blocked'''
 
* '''not completely right, as facebook uses video servers, which can be blocked'''
  
=== Blacklist of already known Content Delivery Network (CDN) addresses  ===
+
=== 1.2 Blacklist of already known Content Delivery Network (CDN) addresses  ===
 +
Blacklist is the open world approach. A potential starting point is to use the topp 500 web pages (national, international,....) and analyse them ''in depth''. Which Web pages are they calling ("all levels below"). This should give us an overview over 90%(?) of the traffic. Strategy: you measure upcoming new web sites, and their traffic, and if the traffic exceeds xxx MB, then you analyse. Might need to tagg the "known" web pages, and then measure the traffic of the "not-known" web pages.
 +
 
 
Pros:  
 
Pros:  
- The second easiest one to implement.  
+
* The second easiest one to implement.  
- Allows to reduce well known CDNs' traffic.  
+
* Allows to reduce well known CDNs' traffic.  
  
 
Cons:  
 
Cons:  
* Video/Audio content delivered through not known CDNs or other  
+
* Video/Audio content delivered through not known CDNs or other addresses is not filtered.  
addresses is not filtered.  
+
* Requires to analyze and update the addresses of CDNs '''%all the time'''
* Requires to analyze and update the addresses of CDNs.  
+
* hard to catch new CDNs
 +
 
 +
Conclusion: ''worthwhile a trial in India'' (or Kinderdorf)
 +
* Both black and white-listing will increase the number of addresses, and will at one place hit the capacity of the Mikrotik equipment
 +
* 20.000 addresses is okay for RBD952
  
 
=== 1.3.- RouterOS L7 filter  ===
 
=== 1.3.- RouterOS L7 filter  ===
Line 96: Line 104:
  
 
Cons:  
 
Cons:  
* Only unencrypted HTTP can be matched. NOT HTTPS.  
+
* Only unencrypted HTTP can be matched. NOT HTTPS, example YouTube does not work.
* Not 100% reliable.  
+
* Not 100% reliable.
 +
 
 +
Conclusions:
 +
* as more and more traffic is moved into https, it is not advisable to use L7-filtering
  
 
=== 2.1.- Web crawler ===
 
=== 2.1.- Web crawler ===
Line 108: Line 119:
 
* Dynamically generated web pages cannot be partially filtered.  Such as login based pages.  
 
* Dynamically generated web pages cannot be partially filtered.  Such as login based pages.  
 
* Not 100% reliable.  
 
* Not 100% reliable.  
* Requires many resources to analyze web pages.  
+
* Requires many resources to analyze web pages.
 +
 
 +
Conclusion:
 +
* ''establish a cloud infrastructure for the filtering''.
 +
* for the "login/https" pages, use the blacklist approach
 +
*
  
 
=== 3.1. Commercial proxy/firewall to filter by Content-Type ===
 
=== 3.1. Commercial proxy/firewall to filter by Content-Type ===
 +
A typical example of such an approach is the security filtering by Palo Alto Networks.
 +
 
Pros:  
 
Pros:  
* Easy to implement.  
+
* Easy to implement. ''% you buy a device which performs the filtering''
 
* Able to filter even HTTPS connections.  
 
* Able to filter even HTTPS connections.  
  
Line 119: Line 137:
 
* Price.  
 
* Price.  
 
* Need to perform a man in the middle for HTTPS connections.  
 
* Need to perform a man in the middle for HTTPS connections.  
* Even if it is paid most probably will not block 100% of not  desired traffic.  
+
* Even if it is paid most probably will not block 100% of not  desired traffic.
  
 
=== 3.2.- Open-Source proxy/firewall to filter by Content-Type ===
 
=== 3.2.- Open-Source proxy/firewall to filter by Content-Type ===
Line 132: Line 150:
 
Cons:  
 
Cons:  
 
* All traffic needs to be centralized.  
 
* All traffic needs to be centralized.  
* Need to perform a man in the middle for HTTPS connections.  
+
* Need to perform a man in the middle for HTTPS connections.
 +
 
 +
Conclusions:
 +
* need further work from our side
  
 
=== 4.1.- Traffic pattern based connection filtering ===
 
=== 4.1.- Traffic pattern based connection filtering ===
 
Research topic
 
Research topic
 
   
 
   
- Pros:  
+
Pros:  
- Works either for HTTP and HTTPS  
+
Works either for HTTP and HTTPS  
- Cons:
+
- Traffic patterns need to be generated for different
+
content-type, bandwidth, etc.
+
- Final implementation on Mikrotiks needs to be analyzed.
+
- If not possible, all traffic would need to be centralized
+
  
=============================================
+
Cons:
 +
* Traffic patterns need to be generated for different  content-type, bandwidth, etc.
 +
* Final implementation on Mikrotiks needs to be analyzed.
 +
* If not possible, all traffic would need to be centralized
  
IMPLEMENTATION PLAN:
+
== IMPLEMENTATION PLAN ==
1.1.- Whitelist  
+
=== 1.1.- Whitelist ===
- Done.
+
* Done.  
1.2.- Blacklist of already known Content Delivery Network (CDN)
+
addresses (Akamai, Cloudfare, CloudFront, Wowza, IBM Cloud Video,
+
Livestream, DaCast, etc.)
+
- Done.
+
1.3.- RouterOS L7 filter
+
- I would need a student to try different filters and check how it performs.
+
- Filter updating scripts would need to be generated.
+
- Mikrotik performance impact would have to be measured.  
+
  
2.1.- Web crawler to analyze requested web pages and populate the blacklists.  
+
=== 1.2.- Blacklist of already known Content Delivery Network (CDN)  ===
- Different crawlers such as Apache Nutch have to be analyzed.  
+
addresses (Akamai, Cloudfare, CloudFront, Wowza, IBM Cloud Video, Livestream, DaCast, etc.)
- Scripts to get DNS requests for later analysis have to be developed.  
+
* Done in principle
 +
* not automated for new upcoming CDN servers
 +
 +
===1.3.- RouterOS L7 filter ===
 +
* Need student work / industrial development, addressing the evaluation of different filters and check how it performs.  
 +
* Filter updating scripts would need to be generated.  
 +
* Mikrotik performance impact would have to be measured.  
  
3.1.- Commercial proxy/firewall to filter by Content-Type
+
=== 2.1.- Web crawler to analyze requested web pages and populate the blacklists.  ===
- Topology needs to be changed to centralize all traffic or at least
+
Topics to do/outsource:
the unauthenticated one.  
+
* Different crawlers such as Apache Nutch have to be analyzed.  
- Device needs to be configured.  
+
* Scripts to get DNS requests for later analysis have to be developed.
 +
* need support team for long-term sustainability
  
3.2.- Open-Source proxy/firewall to filter by Content-Type  
+
=== 3.1.- Commercial proxy/firewall to filter by Content-Type ===
- Different proxy/firewall solutions have to be analyzed to select
+
* Topology needs to be changed to centralize all traffic or at least the unauthenticated one.  
those performing well.
+
* Device needs to be configured.  
- Topology needs to be changed to centralize all traffic or at least  
+
* Cons: '''probably not scalable to fit the low-cost market which we address'''
the unauthenticated one.  
+
- Device needs to be configured.  
+
  
4.1.- Traffic pattern based connection filtering
+
=== 3.2.- Open-Source proxy/firewall to filter by Content-Type ===
- This will require a bachelor or master thesis to analyze traffic
+
* Different proxy/firewall solutions have to be analyzed to select  those performing well.  
patterns and create a lightweight content based filter.  
+
* Topology needs to be changed to centralize all traffic or at least the unauthenticated one.
- Analyze if it is possible to implement the content filter on the Mikrotiks.  
+
* Device needs to be configured.
 +
* Should be combined with traffic pattern, e.g. to analyse video buffering, and then reduces speed to those sites. Could go together with Mikrotik rule to limit traffic to e.g. 500 kbit/s
  
Josef, I would like to further discuss with you all these ideas. In
+
=== 4.1.- Traffic pattern based connection filtering ===
the mean time at Mondragon we will continue with the multi-language
+
* This will require a bachelor or master thesis to analyze traffic patterns and create a lightweight content based filter.
voucher platform development and virtually duplicating the
+
*  Analyze if it is possible to implement the content filter on the Mikrotiks.  
infrastructure.  
+
  
Best regards,  
+
== Conclusion ==
 +
Now:
 +
* continue with the white list filtering,  
 +
* develop the web analysis (2.1 - which web pages are called)
 +
* analyse the web pages for content (3.2 filtering plus 4.2 traffic pattern, evtl Mikrotik throughput reduction)
  
--
 
Iñaki Garitano
 
Data Analysis and Cybersecurity
 
Electronics and Computing Department
 
Mondragon University - Faculty of Engineering
 
Goiru, 2; 20500 Arrasate - Mondragón (Gipuzkoa), Spain
 
Tel. : +(34) 647503682 / +(34) 943794700 + Ext. 8119
 
www.mondragon.edu
 
www.garitano.info / www.garitano.eu
 
  
@mention a user or group to share this mail.
+
Medium term:
Content-Type / Media-Type / MIME filtering
+
* test the black-list approach, by starting with the top 500++ Web pages (2.1) and those ones being called
6
+
* add L7 filtering approach in Mikrotik (''needs to be implemented and tested, did not work out at Kjeller'')
garitano
+
  
Here is your Smart Chat (Ctrl+Space)
+
Longer term:
 +
* combined analysis of 1.2 black-list, 1.3 plus 3.2 and 4.1
  
 
= New ideas =
 
= New ideas =

Latest revision as of 15:18, 7 October 2019

T1.1 Low-cost infrastructure

Task Title Low-cost infrastructure for InfoIntenet
WP DigI:WP-I1
Lead partner Basic Internet Foundation
Leader
Contributors BasicInternet
edit this task

Objective

This task will establish the architecture for low cost access, including:

  • cost calculation for TZ and Congo
Category:Task


Deliverables in T1.1 Low-cost infrastructure

Add Deliverable


Equipment supplier

see DigI:TI1.2 for pilot installations


Content filtering

by Iñaki Garitano (12Sep2018)
Basic understanding of InfoInternet standard:

  • Text & pictures: allowed
  • Streaming, games, high-bandwidth content

The way to filter is known from the security industry, a.o. Palo Alto Networks. However, their solution is focussing on security, and not on low-cost provision of information.

Required:

  • Roadmap to reach the InfoInternet standard
  • Today: whilelist, blacklist, content metadata
  • tommorrow: automatic analysis (either be real-time or off-line)
  • Final InfoInternet standard: Public Database supporting local filtering

Methods

  • Decentralized = each Mikrotik has to do something.
  • Centralized = all traffic (at least the unauthenticated one) goes through the Basic Internet core.

Methods, ordered by centralized/decentralized filtering plus difficulty/time to implement:

1.- Decentralized filtering

  • 1.1.- Whitelist
  • 1.2.- Blacklist of already known Content Delivery Network (CDN), addresses (Akamai, Cloudfare, CloudFront, Wowza, IBM Cloud Video, Livestream, DaCast, etc.)
  • 1.3.- RouterOS L7 filter

2.- Semi centralized/decentralized - some actions have to be done in the core while others in the Mikrotiks

  • 2.1.- Web crawler to analyze requested web pages and populate the blacklists

3.- Centralized filtering

  • 3.1.- Commercial proxy/firewall to filter by Content-Type
  • 3.2.- Open-Source proxy/firewall to filter by Content-Type

4.- Needs more research because maybe could be done decentralized

  • 4.1.- Traffic pattern based connection filtering

PROS & CONS

1.- Decentralized filtering

  • Cons:
  • Mikrotik devices have to be populated by new configuration updates.
  • Some performance overhead may occur. Would be interesting to somehow measure it.
  • overhead is very small, we can easily handle ~70 Mbit/s (including 25 rules firewall, results in 60 GByte/day) on an RB960 (RDB952 - max 900 Mbit/s) - only when traffic is tagged and measured

2.- Semi centralized/decentralized - some actions have to be done in the core while others in the Mikrotiks

  • Cons:
  • Core infrastructure needs to be prepared. % core infrastructure is in place. Whitelist is centrally located (owncloud), and populated to the LNCC
  • Mikrotik devices have to be populated by new configuration updates.
  • Some performance overhead may occur. Would be interesting to somehow measure it.

3.- Centralized filtering

  • Cons:
  • All traffic needs to go through a centralized device.
  • % applicable to some traffic, not all traffic - not suitable for all traffic, as the backbone traffic is the main cost (and other topics such as virus filtering, internation traffic...)

1.1.- Whitelist

White listing is the closed world approach, which is easier to manage

Pros:

  • The easiest one to implement.
  • Allows to reduce most of the traffic.

Cons:

  • The most restrictive one.
  • Requires to analyze the content of each web page.
  • Dynamically generated web pages such as Facebook have to be blocked because is not possible to analyze their content beforehand.
  • not completely right, as facebook uses video servers, which can be blocked

1.2 Blacklist of already known Content Delivery Network (CDN) addresses

Blacklist is the open world approach. A potential starting point is to use the topp 500 web pages (national, international,....) and analyse them in depth. Which Web pages are they calling ("all levels below"). This should give us an overview over 90%(?) of the traffic. Strategy: you measure upcoming new web sites, and their traffic, and if the traffic exceeds xxx MB, then you analyse. Might need to tagg the "known" web pages, and then measure the traffic of the "not-known" web pages.

Pros:

  • The second easiest one to implement.
  • Allows to reduce well known CDNs' traffic.

Cons:

  • Video/Audio content delivered through not known CDNs or other addresses is not filtered.
  • Requires to analyze and update the addresses of CDNs %all the time
  • hard to catch new CDNs

Conclusion: worthwhile a trial in India (or Kinderdorf)

  • Both black and white-listing will increase the number of addresses, and will at one place hit the capacity of the Mikrotik equipment
  • 20.000 addresses is okay for RBD952

1.3.- RouterOS L7 filter

Pros:

Cons:

  • Only unencrypted HTTP can be matched. NOT HTTPS, example YouTube does not work.
  • Not 100% reliable.

Conclusions:

  • as more and more traffic is moved into https, it is not advisable to use L7-filtering

2.1.- Web crawler

to analyze requested web pages and populate the blacklists

Pros:

  • Could be combined with 1.1, 1.2 and 1.3.

Cons:

  • Dynamically generated web pages cannot be partially filtered. Such as login based pages.
  • Not 100% reliable.
  • Requires many resources to analyze web pages.

Conclusion:

  • establish a cloud infrastructure for the filtering.
  • for the "login/https" pages, use the blacklist approach

3.1. Commercial proxy/firewall to filter by Content-Type

A typical example of such an approach is the security filtering by Palo Alto Networks.

Pros:

  • Easy to implement. % you buy a device which performs the filtering
  • Able to filter even HTTPS connections.

Cons:

  • All traffic needs to be centralized.
  • Price.
  • Need to perform a man in the middle for HTTPS connections.
  • Even if it is paid most probably will not block 100% of not desired traffic.

3.2.- Open-Source proxy/firewall to filter by Content-Type

Examples:

Pros:

  • Cheap.

Cons:

  • All traffic needs to be centralized.
  • Need to perform a man in the middle for HTTPS connections.

Conclusions:

  • need further work from our side

4.1.- Traffic pattern based connection filtering

Research topic

Pros:

  • Works either for HTTP and HTTPS

Cons:

  • Traffic patterns need to be generated for different content-type, bandwidth, etc.
  • Final implementation on Mikrotiks needs to be analyzed.
  • If not possible, all traffic would need to be centralized

IMPLEMENTATION PLAN

1.1.- Whitelist

  • Done.

1.2.- Blacklist of already known Content Delivery Network (CDN)

addresses (Akamai, Cloudfare, CloudFront, Wowza, IBM Cloud Video, Livestream, DaCast, etc.)

  • Done in principle
  • not automated for new upcoming CDN servers

1.3.- RouterOS L7 filter

  • Need student work / industrial development, addressing the evaluation of different filters and check how it performs.
  • Filter updating scripts would need to be generated.
  • Mikrotik performance impact would have to be measured.

2.1.- Web crawler to analyze requested web pages and populate the blacklists.

Topics to do/outsource:

  • Different crawlers such as Apache Nutch have to be analyzed.
  • Scripts to get DNS requests for later analysis have to be developed.
  • need support team for long-term sustainability

3.1.- Commercial proxy/firewall to filter by Content-Type

  • Topology needs to be changed to centralize all traffic or at least the unauthenticated one.
  • Device needs to be configured.
  • Cons: probably not scalable to fit the low-cost market which we address

3.2.- Open-Source proxy/firewall to filter by Content-Type

  • Different proxy/firewall solutions have to be analyzed to select those performing well.
  • Topology needs to be changed to centralize all traffic or at least the unauthenticated one.
  • Device needs to be configured.
  • Should be combined with traffic pattern, e.g. to analyse video buffering, and then reduces speed to those sites. Could go together with Mikrotik rule to limit traffic to e.g. 500 kbit/s

4.1.- Traffic pattern based connection filtering

  • This will require a bachelor or master thesis to analyze traffic patterns and create a lightweight content based filter.
  • Analyze if it is possible to implement the content filter on the Mikrotiks.

Conclusion

Now:

  • continue with the white list filtering,
  • develop the web analysis (2.1 - which web pages are called)
  • analyse the web pages for content (3.2 filtering plus 4.2 traffic pattern, evtl Mikrotik throughput reduction)


Medium term:

  • test the black-list approach, by starting with the top 500++ Web pages (2.1) and those ones being called
  • add L7 filtering approach in Mikrotik (needs to be implemented and tested, did not work out at Kjeller)

Longer term:

  • combined analysis of 1.2 black-list, 1.3 plus 3.2 and 4.1

New ideas

QR code scanning for wifi access code

QR code for voucher access, alternative: SMS

Cost calculation

Calculations of costs, using TZ as example (owncloud confidential) https://owncloud.unik.no/index.php/apps/files/ajax/download.php?dir=%2F1-Projects%2FBasicInternet%2FTechnology%2FCost-Infrastructure&files=Infra_cost_Template_Tz.xlsx